Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-20 Thread Zheng Lin Edwin Yeo
Hi Paul,

Would like to check, if there is any difference in performance when we use
the two different patterns method?

(\n\W*){2,}

[ \t\x0b\f]*\r?\n

Regards,
Edwin

On Thu, 14 Mar 2019 at 09:36, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> Thanks for your reply.
>
> So far we did not find cases of punctuation that are being removed.
>
> Our aim is to remove the list of spaces (\n) into 2 , and they are not
> likely to have any punctuation in between.
>
> Do you know if this pattern  (\n\W*){2,} that
> we are using is ok?
> Or would the other pattern like  [
> \t\x0b\f]*\r?\n is better?
>
> Regards,
> Edwin
>
> On Wed, 13 Mar 2019 at 20:08,  wrote:
>
>> Hi Edwin,
>> With \W you will also replace non-word characters such as punktuation. If
>> that's OK fine. Otherwise you need to identify the white space characters
>> that are causing the problem.
>> 
>> Von: Zheng Lin Edwin Yeo 
>> Gesendet: Mittwoch, 13. März 2019 03:25:39
>> An: solr-user@lucene.apache.org
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>> Hi,
>>
>> We have managed to resolve the issue, by changing the \s to \W. The reason
>> could be due to that some of the spaces and white space instead of just a
>> space. Using \s will only remove the spaces and not the white spaces, but
>> using \W will remove the white spaces as well.
>>
>> We have used this config, and it works.
>>
>> 
>>content
>>(\n\W*){2,}
>>brbr
>>true
>> 
>> 
>>content
>>(\n\W*){1,}
>>br
>>true
>> 
>>
>> Regards,
>> Edwin
>>
>> On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo 
>> wrote:
>>
>> > Hi,
>> >
>> > Has anyone else faced the same issue before?
>> > So far all the regex patterns that we tried in this thread are not able
>> to
>> > resolve the issue.
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo 
>> > wrote:
>> >
>> >> Hi Paul,
>> >>
>> >> Sorry, I realized there is an extra ']' in the pattern provided, which
>> is
>> >> why there are so many  in the output.
>> >>
>> >> The output is exactly the same as previously (previous index result) if
>> >> we remove the extra ']', as shown in the configuration below.
>> >>
>> >>  
>> >>content
>> >>[ \t\x0b\f]*\r?\n
>> >>br
>> >>true
>> >>  
>> >>  
>> >>content
>> >>(br[ \t\x0b\f]*){3,}
>> >>brbr
>> >>true
>> >>  
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>> >>
>> >> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo > >
>> >> wrote:
>> >>
>> >>> Hi Paul,
>> >>>
>> >>> Thanks for the reply.
>> >>>
>> >>> For the 2nd pattern, if we put this pattern > >>> name="pattern">(br[ \t\x0b\f]]*){3,}, which is like the
>> >>> configurations below:
>> >>>
>> >>> 
>> >>>content
>> >>>[ \t\x0b\f]*\r?\n
>> >>>br
>> >>>true
>> >>> 
>> >>> 
>> >>>content
>> >>>(br[ \t\x0b\f]]*){3,}
>> >>>brbr
>> >>>true
>> >>> 
>> >>>
>> >>> It will not be able to change all those more than 3  to 2 .
>> >>>
>> >>> We will end up with many  in the output, like the example below:
>> >>>
>> >>>  http://www.concorded.com/
>> 
>> On Tue, Dec 18, 2018
>> >>>
>> >>>
>> >>> Regards,
>> >>> Edwin
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, 7 Mar 2019 at 20:44,  wrote:
>> >>>
>> >>>> Hi Edwin
>> >>>>
>> >>>>
>> >>>>
>> >>>> I can’t understand why the pattern is not working and where the
>> spaces
>> >>>> between the  are coming from. It should be possible to allow for
>> spaces
>> >>>> between the  in the second 

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-13 Thread Zheng Lin Edwin Yeo
Hi Paul,

Thanks for your reply.

So far we did not find cases of punctuation that are being removed.

Our aim is to remove the list of spaces (\n) into 2 , and they are not
likely to have any punctuation in between.

Do you know if this pattern  (\n\W*){2,} that we
are using is ok?
Or would the other pattern like  [
\t\x0b\f]*\r?\n is better?

Regards,
Edwin

On Wed, 13 Mar 2019 at 20:08,  wrote:

> Hi Edwin,
> With \W you will also replace non-word characters such as punktuation. If
> that's OK fine. Otherwise you need to identify the white space characters
> that are causing the problem.
> 
> Von: Zheng Lin Edwin Yeo 
> Gesendet: Mittwoch, 13. März 2019 03:25:39
> An: solr-user@lucene.apache.org
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
> Hi,
>
> We have managed to resolve the issue, by changing the \s to \W. The reason
> could be due to that some of the spaces and white space instead of just a
> space. Using \s will only remove the spaces and not the white spaces, but
> using \W will remove the white spaces as well.
>
> We have used this config, and it works.
>
> 
>content
>(\n\W*){2,}
>brbr
>true
> 
> 
>content
>(\n\W*){1,}
>br
>true
> 
>
> Regards,
> Edwin
>
> On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo 
> wrote:
>
> > Hi,
> >
> > Has anyone else faced the same issue before?
> > So far all the regex patterns that we tried in this thread are not able
> to
> > resolve the issue.
> >
> > Regards,
> > Edwin
> >
> > On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo 
> > wrote:
> >
> >> Hi Paul,
> >>
> >> Sorry, I realized there is an extra ']' in the pattern provided, which
> is
> >> why there are so many  in the output.
> >>
> >> The output is exactly the same as previously (previous index result) if
> >> we remove the extra ']', as shown in the configuration below.
> >>
> >>  
> >>content
> >>[ \t\x0b\f]*\r?\n
> >>br
> >>true
> >>  
> >>  
> >>content
> >>(br[ \t\x0b\f]*){3,}
> >>brbr
> >>true
> >>  
> >>
> >> Regards,
> >> Edwin
> >>
> >>
> >>
> >> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo 
> >> wrote:
> >>
> >>> Hi Paul,
> >>>
> >>> Thanks for the reply.
> >>>
> >>> For the 2nd pattern, if we put this pattern  >>> name="pattern">(br[ \t\x0b\f]]*){3,}, which is like the
> >>> configurations below:
> >>>
> >>> 
> >>>content
> >>>[ \t\x0b\f]*\r?\n
> >>>br
> >>>true
> >>> 
> >>> 
> >>>content
> >>>(br[ \t\x0b\f]]*){3,}
> >>>brbr
> >>>true
> >>> 
> >>>
> >>> It will not be able to change all those more than 3  to 2 .
> >>>
> >>> We will end up with many  in the output, like the example below:
> >>>
> >>>  http://www.concorded.com/
> 
> On Tue, Dec 18, 2018
> >>>
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, 7 Mar 2019 at 20:44,  wrote:
> >>>
> >>>> Hi Edwin
> >>>>
> >>>>
> >>>>
> >>>> I can’t understand why the pattern is not working and where the spaces
> >>>> between the  are coming from. It should be possible to allow for
> spaces
> >>>> between the  in the second match pattern however i.e. 2nd pattern
> >>>>
> >>>>
> >>>>
> >>>> (br[ \t\x0b\f]]*){3,}
> >>>>
> >>>>
> >>>>
> >>>> /Paul
> >>>>
> >>>>
> >>>>
> >>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> >>>> Windows 10
> >>>>
> >>>>
> >>>>
> >>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> >>>> Gesendet: Mittwoch, 6. März 2019 16:28
> >>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> >>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
> \n
> >>>>
>

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-13 Thread paul.dodd
Hi Edwin,
With \W you will also replace non-word characters such as punktuation. If 
that's OK fine. Otherwise you need to identify the white space characters that 
are causing the problem.

Von: Zheng Lin Edwin Yeo 
Gesendet: Mittwoch, 13. März 2019 03:25:39
An: solr-user@lucene.apache.org
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Hi,

We have managed to resolve the issue, by changing the \s to \W. The reason
could be due to that some of the spaces and white space instead of just a
space. Using \s will only remove the spaces and not the white spaces, but
using \W will remove the white spaces as well.

We have used this config, and it works.


   content
   (\n\W*){2,}
   brbr
   true


   content
   (\n\W*){1,}
   br
   true


Regards,
Edwin

On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> Has anyone else faced the same issue before?
> So far all the regex patterns that we tried in this thread are not able to
> resolve the issue.
>
> Regards,
> Edwin
>
> On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Paul,
>>
>> Sorry, I realized there is an extra ']' in the pattern provided, which is
>> why there are so many  in the output.
>>
>> The output is exactly the same as previously (previous index result) if
>> we remove the extra ']', as shown in the configuration below.
>>
>>  
>>content
>>[ \t\x0b\f]*\r?\n
>>br
>>true
>>  
>>  
>>content
>>(br[ \t\x0b\f]*){3,}
>>brbr
>>true
>>  
>>
>> Regards,
>> Edwin
>>
>>
>>
>> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo 
>> wrote:
>>
>>> Hi Paul,
>>>
>>> Thanks for the reply.
>>>
>>> For the 2nd pattern, if we put this pattern >> name="pattern">(br[ \t\x0b\f]]*){3,}, which is like the
>>> configurations below:
>>>
>>> 
>>>content
>>>[ \t\x0b\f]*\r?\n
>>>br
>>>true
>>> 
>>> 
>>>content
>>>(br[ \t\x0b\f]]*){3,}
>>>brbr
>>>true
>>> 
>>>
>>> It will not be able to change all those more than 3  to 2 .
>>>
>>> We will end up with many  in the output, like the example below:
>>>
>>>  http://www.concorded.com/  
>>> 
>>>  On Tue, Dec 18, 2018
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>>
>>>
>>> On Thu, 7 Mar 2019 at 20:44,  wrote:
>>>
>>>> Hi Edwin
>>>>
>>>>
>>>>
>>>> I can’t understand why the pattern is not working and where the spaces
>>>> between the  are coming from. It should be possible to allow for spaces
>>>> between the  in the second match pattern however i.e. 2nd pattern
>>>>
>>>>
>>>>
>>>> (br[ \t\x0b\f]]*){3,}
>>>>
>>>>
>>>>
>>>> /Paul
>>>>
>>>>
>>>>
>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> Windows 10
>>>>
>>>>
>>>>
>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>>> Gesendet: Mittwoch, 6. März 2019 16:28
>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>
>>>>
>>>>
>>>> Hi Paul,
>>>>
>>>> I have tried with the first match pattern to be [
>>>> \t\x0b\f]*\r?\n, like the configuration below:
>>>>
>>>> 
>>>>content
>>>>    [ \t\x0b\f]*\r?\n
>>>>br
>>>>true
>>>> 
>>>> 
>>>>content
>>>>(br){3,}
>>>>brbr
>>>>true
>>>> 
>>>>
>>>> However, the result is still the same as before (previous index
>>>> results),
>>>> with the 4 .
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>>
>>>> On Wed, 6 Mar 2019 at 18:23,  wrote:
>>>>
>>>> > Hi Edwin
>>>> >
>>>> >
>>>> >
>>>> > You are correct  re the 2nd pattern – my bad. Looking at the 4 ,
>>>> it’s
>>>> > actually t

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-12 Thread Zheng Lin Edwin Yeo
Hi,

We have managed to resolve the issue, by changing the \s to \W. The reason
could be due to that some of the spaces and white space instead of just a
space. Using \s will only remove the spaces and not the white spaces, but
using \W will remove the white spaces as well.

We have used this config, and it works.


   content
   (\n\W*){2,}
   brbr
   true


   content
   (\n\W*){1,}
   br
   true


Regards,
Edwin

On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> Has anyone else faced the same issue before?
> So far all the regex patterns that we tried in this thread are not able to
> resolve the issue.
>
> Regards,
> Edwin
>
> On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Paul,
>>
>> Sorry, I realized there is an extra ']' in the pattern provided, which is
>> why there are so many  in the output.
>>
>> The output is exactly the same as previously (previous index result) if
>> we remove the extra ']', as shown in the configuration below.
>>
>>  
>>content
>>[ \t\x0b\f]*\r?\n
>>br
>>true
>>  
>>  
>>content
>>(br[ \t\x0b\f]*){3,}
>>brbr
>>true
>>  
>>
>> Regards,
>> Edwin
>>
>>
>>
>> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo 
>> wrote:
>>
>>> Hi Paul,
>>>
>>> Thanks for the reply.
>>>
>>> For the 2nd pattern, if we put this pattern >> name="pattern">(br[ \t\x0b\f]]*){3,}, which is like the
>>> configurations below:
>>>
>>> 
>>>content
>>>[ \t\x0b\f]*\r?\n
>>>br
>>>true
>>> 
>>> 
>>>content
>>>(br[ \t\x0b\f]]*){3,}
>>>brbr
>>>true
>>> 
>>>
>>> It will not be able to change all those more than 3  to 2 .
>>>
>>> We will end up with many  in the output, like the example below:
>>>
>>>  http://www.concorded.com/  
>>> 
>>>  On Tue, Dec 18, 2018
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>>
>>>
>>> On Thu, 7 Mar 2019 at 20:44,  wrote:
>>>
>>>> Hi Edwin
>>>>
>>>>
>>>>
>>>> I can’t understand why the pattern is not working and where the spaces
>>>> between the  are coming from. It should be possible to allow for spaces
>>>> between the  in the second match pattern however i.e. 2nd pattern
>>>>
>>>>
>>>>
>>>> (br[ \t\x0b\f]]*){3,}
>>>>
>>>>
>>>>
>>>> /Paul
>>>>
>>>>
>>>>
>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> Windows 10
>>>>
>>>>
>>>>
>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>>> Gesendet: Mittwoch, 6. März 2019 16:28
>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>
>>>>
>>>>
>>>> Hi Paul,
>>>>
>>>> I have tried with the first match pattern to be [
>>>> \t\x0b\f]*\r?\n, like the configuration below:
>>>>
>>>> 
>>>>content
>>>>    [ \t\x0b\f]*\r?\n
>>>>br
>>>>true
>>>> 
>>>> 
>>>>content
>>>>(br){3,}
>>>>brbr
>>>>true
>>>> 
>>>>
>>>> However, the result is still the same as before (previous index
>>>> results),
>>>> with the 4 .
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>>
>>>> On Wed, 6 Mar 2019 at 18:23,  wrote:
>>>>
>>>> > Hi Edwin
>>>> >
>>>> >
>>>> >
>>>> > You are correct  re the 2nd pattern – my bad. Looking at the 4 ,
>>>> it’s
>>>> > actually the sequence «  »? So perhaps the first match
>>>> > pattern could be [ \t\x0b\f]*\r?\n
>>>> >
>>>> >
>>>> >
>>>> > i.e. [space tab vertical-tab formfeed]
>>>> >
>>>> >
>>>> >
>>>> > Regards,
>>>> >
>>>> > Paul
>>>> >
>>>> >
&g

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-11 Thread Zheng Lin Edwin Yeo
Hi,

Has anyone else faced the same issue before?
So far all the regex patterns that we tried in this thread are not able to
resolve the issue.

Regards,
Edwin

On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> Sorry, I realized there is an extra ']' in the pattern provided, which is
> why there are so many  in the output.
>
> The output is exactly the same as previously (previous index result) if we
> remove the extra ']', as shown in the configuration below.
>
>  
>content
>[ \t\x0b\f]*\r?\n
>br
>true
>  
>  
>content
>(br[ \t\x0b\f]*){3,}
>brbr
>true
>  
>
> Regards,
> Edwin
>
>
>
> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Paul,
>>
>> Thanks for the reply.
>>
>> For the 2nd pattern, if we put this pattern > name="pattern">(br[ \t\x0b\f]]*){3,}, which is like the
>> configurations below:
>>
>> 
>>content
>>[ \t\x0b\f]*\r?\n
>>br
>>true
>> 
>> 
>>content
>>(br[ \t\x0b\f]]*){3,}
>>brbr
>>true
>> 
>>
>> It will not be able to change all those more than 3  to 2 .
>>
>> We will end up with many  in the output, like the example below:
>>
>>  http://www.concorded.com/  
>> 
>>  On Tue, Dec 18, 2018
>>
>>
>> Regards,
>> Edwin
>>
>>
>>
>>
>> On Thu, 7 Mar 2019 at 20:44,  wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>> I can’t understand why the pattern is not working and where the spaces
>>> between the  are coming from. It should be possible to allow for spaces
>>> between the  in the second match pattern however i.e. 2nd pattern
>>>
>>>
>>>
>>> (br[ \t\x0b\f]]*){3,}
>>>
>>>
>>>
>>> /Paul
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> Gesendet: Mittwoch, 6. März 2019 16:28
>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Paul,
>>>
>>> I have tried with the first match pattern to be [
>>> \t\x0b\f]*\r?\n, like the configuration below:
>>>
>>> 
>>>content
>>>[ \t\x0b\f]*\r?\n
>>>br
>>>true
>>> 
>>> 
>>>content
>>>(br){3,}
>>>brbr
>>>true
>>> 
>>>
>>> However, the result is still the same as before (previous index results),
>>> with the 4 .
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>> On Wed, 6 Mar 2019 at 18:23,  wrote:
>>>
>>> > Hi Edwin
>>> >
>>> >
>>> >
>>> > You are correct  re the 2nd pattern – my bad. Looking at the 4 ,
>>> it’s
>>> > actually the sequence «  »? So perhaps the first match
>>> > pattern could be [ \t\x0b\f]*\r?\n
>>> >
>>> >
>>> >
>>> > i.e. [space tab vertical-tab formfeed]
>>> >
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Paul
>>> >
>>> >
>>> >
>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> > Windows 10
>>> >
>>> >
>>> >
>>> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> > Gesendet: Mittwoch, 6. März 2019 07:44
>>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>> >
>>> >
>>> >
>>> > Hi Paul,
>>> >
>>> > I have modified the second pattern to be (br){3,}, instead of
>>> > (brbr){3,}. This pattern of  (brbr){3,}
>>> > will actually look for 6 or more  instead of 3 ,  as we have
>>> put
>>> > the  two times in the pattern, which is the reason that there are
>>> more
>>> >  in the result, as cases where there are less than 6  are not
>>> being
>>> > replaced, so we ended up having up to 5  in the index.
>>&

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-07 Thread Zheng Lin Edwin Yeo
Hi Paul,

Sorry, I realized there is an extra ']' in the pattern provided, which is
why there are so many  in the output.

The output is exactly the same as previously (previous index result) if we
remove the extra ']', as shown in the configuration below.

 
   content
   [ \t\x0b\f]*\r?\n
   br
   true
 
 
   content
   (br[ \t\x0b\f]*){3,}
   brbr
   true
 

Regards,
Edwin



On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> Thanks for the reply.
>
> For the 2nd pattern, if we put this pattern  name="pattern">(br[ \t\x0b\f]]*){3,}, which is like the
> configurations below:
>
> 
>content
>[ \t\x0b\f]*\r?\n
>br
>true
> 
> 
>content
>(br[ \t\x0b\f]]*){3,}
>brbr
>true
> 
>
> It will not be able to change all those more than 3  to 2 .
>
> We will end up with many  in the output, like the example below:
>
>  http://www.concorded.com/  
> 
>  On Tue, Dec 18, 2018
>
>
> Regards,
> Edwin
>
>
>
>
> On Thu, 7 Mar 2019 at 20:44,  wrote:
>
>> Hi Edwin
>>
>>
>>
>> I can’t understand why the pattern is not working and where the spaces
>> between the  are coming from. It should be possible to allow for spaces
>> between the  in the second match pattern however i.e. 2nd pattern
>>
>>
>>
>> (br[ \t\x0b\f]]*){3,}
>>
>>
>>
>> /Paul
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> Gesendet: Mittwoch, 6. März 2019 16:28
>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi Paul,
>>
>> I have tried with the first match pattern to be [
>> \t\x0b\f]*\r?\n, like the configuration below:
>>
>> 
>>content
>>[ \t\x0b\f]*\r?\n
>>br
>>true
>> 
>> 
>>content
>>(br){3,}
>>brbr
>>true
>> 
>>
>> However, the result is still the same as before (previous index results),
>> with the 4 .
>>
>> Regards,
>> Edwin
>>
>>
>> On Wed, 6 Mar 2019 at 18:23,  wrote:
>>
>> > Hi Edwin
>> >
>> >
>> >
>> > You are correct  re the 2nd pattern – my bad. Looking at the 4 ,
>> it’s
>> > actually the sequence «  »? So perhaps the first match
>> > pattern could be [ \t\x0b\f]*\r?\n
>> >
>> >
>> >
>> > i.e. [space tab vertical-tab formfeed]
>> >
>> >
>> >
>> > Regards,
>> >
>> > Paul
>> >
>> >
>> >
>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> > Windows 10
>> >
>> >
>> >
>> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> > Gesendet: Mittwoch, 6. März 2019 07:44
>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>> >
>> >
>> >
>> > Hi Paul,
>> >
>> > I have modified the second pattern to be (br){3,}, instead of
>> > (brbr){3,}. This pattern of  (brbr){3,}
>> > will actually look for 6 or more  instead of 3 ,  as we have put
>> > the  two times in the pattern, which is the reason that there are
>> more
>> >  in the result, as cases where there are less than 6  are not
>> being
>> > replaced, so we ended up having up to 5  in the index.
>> >
>> > Modified configuration:
>> >  
>> >content
>> >(br){3,}
>> >brbr
>> >true
>> >  
>> >
>> > This will bring us back to the result of the previous index content,
>> > meaning the issue of having the 4  is still there.
>> >
>> > Regards,
>> > Edwin
>> >
>> >
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo 
>> > wrote:
>> >
>> > > Hi Paul,
>> > >
>> > > Further to my previous email, which there was an extra "}" in the
>> > > configuration, I have changed to use the below configuration based on
>> > your
>> > > suggestion.
>> > >
>> > > 
>> > &

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-07 Thread Zheng Lin Edwin Yeo
Hi Paul,

Thanks for the reply.

For the 2nd pattern, if we put this pattern (br[ \t\x0b\f]]*){3,}, which is like the
configurations below:


   content
   [ \t\x0b\f]*\r?\n
   br
   true


   content
   (br[ \t\x0b\f]]*){3,}
   brbr
   true


It will not be able to change all those more than 3  to 2 .

We will end up with many  in the output, like the example below:

 http://www.concorded.com/

On Tue, Dec 18, 2018


Regards,
Edwin




On Thu, 7 Mar 2019 at 20:44,  wrote:

> Hi Edwin
>
>
>
> I can’t understand why the pattern is not working and where the spaces
> between the  are coming from. It should be possible to allow for spaces
> between the  in the second match pattern however i.e. 2nd pattern
>
>
>
> (br[ \t\x0b\f]]*){3,}
>
>
>
> /Paul
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Mittwoch, 6. März 2019 16:28
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> I have tried with the first match pattern to be [
> \t\x0b\f]*\r?\n, like the configuration below:
>
> 
>content
>[ \t\x0b\f]*\r?\n
>br
>true
> 
> 
>content
>(br){3,}
>brbr
>true
> 
>
> However, the result is still the same as before (previous index results),
> with the 4 .
>
> Regards,
> Edwin
>
>
> On Wed, 6 Mar 2019 at 18:23,  wrote:
>
> > Hi Edwin
> >
> >
> >
> > You are correct  re the 2nd pattern – my bad. Looking at the 4 , it’s
> > actually the sequence «  »? So perhaps the first match
> > pattern could be [ \t\x0b\f]*\r?\n
> >
> >
> >
> > i.e. [space tab vertical-tab formfeed]
> >
> >
> >
> > Regards,
> >
> > Paul
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> > Gesendet: Mittwoch, 6. März 2019 07:44
> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi Paul,
> >
> > I have modified the second pattern to be (br){3,}, instead of
> > (brbr){3,}. This pattern of  (brbr){3,}
> > will actually look for 6 or more  instead of 3 ,  as we have put
> > the  two times in the pattern, which is the reason that there are
> more
> >  in the result, as cases where there are less than 6  are not
> being
> > replaced, so we ended up having up to 5  in the index.
> >
> > Modified configuration:
> >  
> >content
> >(br){3,}
> >brbr
> >true
> >  
> >
> > This will bring us back to the result of the previous index content,
> > meaning the issue of having the 4  is still there.
> >
> > Regards,
> > Edwin
> >
> >
> >
> > Regards,
> > Edwin
> >
> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo 
> > wrote:
> >
> > > Hi Paul,
> > >
> > > Further to my previous email, which there was an extra "}" in the
> > > configuration, I have changed to use the below configuration based on
> > your
> > > suggestion.
> > >
> > > 
> > >content
> > >[ \t]*\r?\n
> > >br
> > >true
> > > 
> > > 
> > >content
> > >(brbr){3,}
> > >brbr
> > >true
> > > 
> > >
> > > However, the result that I get still has more than 2 . In fact, the
> > > result become worse, as you can see from the comparison below.
> > >
> > > Example 1: The sentence that the regex pattern used to work correctly.
> > But
> > > with the latest pattern, it has now changed from 2  to become 5
> ,
> > > which is wrong.
> > > *Original content in EML file:*
> > > Dear Sir,
> > >
> > >
> > > I am terminating
> > > *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> > > *Previous Index content: *Dear Sir,  I am terminating
> > > *Current Index content*:   Dear Sir,  I am
> > terminating
> > >
> > > Example 2: The sentence that the above regex pattern is partially
> working
> > > (as you can see, instead of 2 , there are

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-07 Thread paul.dodd
Hi Edwin



I can’t understand why the pattern is not working and where the spaces between 
the  are coming from. It should be possible to allow for spaces between the 
 in the second match pattern however i.e. 2nd pattern



(br[ \t\x0b\f]]*){3,}



/Paul



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
Gesendet: Mittwoch, 6. März 2019 16:28
An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Paul,

I have tried with the first match pattern to be [
\t\x0b\f]*\r?\n, like the configuration below:


   content
   [ \t\x0b\f]*\r?\n
   br
   true


   content
   (br){3,}
   brbr
   true


However, the result is still the same as before (previous index results),
with the 4 .

Regards,
Edwin


On Wed, 6 Mar 2019 at 18:23,  wrote:

> Hi Edwin
>
>
>
> You are correct  re the 2nd pattern – my bad. Looking at the 4 , it’s
> actually the sequence «  »? So perhaps the first match
> pattern could be [ \t\x0b\f]*\r?\n
>
>
>
> i.e. [space tab vertical-tab formfeed]
>
>
>
> Regards,
>
> Paul
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Mittwoch, 6. März 2019 07:44
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> I have modified the second pattern to be (br){3,}, instead of
> (brbr){3,}. This pattern of  (brbr){3,}
> will actually look for 6 or more  instead of 3 ,  as we have put
> the  two times in the pattern, which is the reason that there are more
>  in the result, as cases where there are less than 6  are not being
> replaced, so we ended up having up to 5  in the index.
>
> Modified configuration:
>  
>content
>(br){3,}
>brbr
>true
>  
>
> This will bring us back to the result of the previous index content,
> meaning the issue of having the 4  is still there.
>
> Regards,
> Edwin
>
>
>
> Regards,
> Edwin
>
> On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo 
> wrote:
>
> > Hi Paul,
> >
> > Further to my previous email, which there was an extra "}" in the
> > configuration, I have changed to use the below configuration based on
> your
> > suggestion.
> >
> > 
> >content
> >[ \t]*\r?\n
> >br
> >true
> > 
> > 
> >content
> >(brbr){3,}
> >brbr
> >true
> > 
> >
> > However, the result that I get still has more than 2 . In fact, the
> > result become worse, as you can see from the comparison below.
> >
> > Example 1: The sentence that the regex pattern used to work correctly.
> But
> > with the latest pattern, it has now changed from 2  to become 5 ,
> > which is wrong.
> > *Original content in EML file:*
> > Dear Sir,
> >
> >
> > I am terminating
> > *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> > *Previous Index content: *Dear Sir,  I am terminating
> > *Current Index content*:   Dear Sir,  I am
> terminating
> >
> > Example 2: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 , there are 4 )
> > *Original content in EML file:*
> >
> > *exalted*
> >
> > *Psalm 89:17*
> >
> >
> > 3 Choa Chu Kang Avenue 4
> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> > Chu Kang Avenue 4, Singapore
> > *Previous Index content: *exalted  Psalm 89:17   
> > 3 Choa Chu Kang Avenue 4, Singapore
> > *Current Index content*:Psalm 89:173
> > Choa Chu Kang Avenue 3, Singapor4
> >
> > Example 3: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 , there are 4 ). For the latest
> code,
> > there are now 5 
> > *Original content in EML file:*
> >
> > http://www.concorded.com/
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Dec 18, 2018 at 10:07 AM
> > *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
> > 10:07 AM
> > *Previous Index content: *http://www.concorded.com/   
> > On Tue, Dec 18, 2018 at 10:07 AM
> > *Current Index content:* http://www.concorded.com/ 

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-06 Thread Zheng Lin Edwin Yeo
Hi Paul,

I have tried with the first match pattern to be [
\t\x0b\f]*\r?\n, like the configuration below:


   content
   [ \t\x0b\f]*\r?\n
   br
   true


   content
   (br){3,}
   brbr
   true


However, the result is still the same as before (previous index results),
with the 4 .

Regards,
Edwin


On Wed, 6 Mar 2019 at 18:23,  wrote:

> Hi Edwin
>
>
>
> You are correct  re the 2nd pattern – my bad. Looking at the 4 , it’s
> actually the sequence «  »? So perhaps the first match
> pattern could be [ \t\x0b\f]*\r?\n
>
>
>
> i.e. [space tab vertical-tab formfeed]
>
>
>
> Regards,
>
> Paul
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Mittwoch, 6. März 2019 07:44
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> I have modified the second pattern to be (br){3,}, instead of
> (brbr){3,}. This pattern of  (brbr){3,}
> will actually look for 6 or more  instead of 3 ,  as we have put
> the  two times in the pattern, which is the reason that there are more
>  in the result, as cases where there are less than 6  are not being
> replaced, so we ended up having up to 5  in the index.
>
> Modified configuration:
>  
>content
>(br){3,}
>brbr
>true
>  
>
> This will bring us back to the result of the previous index content,
> meaning the issue of having the 4  is still there.
>
> Regards,
> Edwin
>
>
>
> Regards,
> Edwin
>
> On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo 
> wrote:
>
> > Hi Paul,
> >
> > Further to my previous email, which there was an extra "}" in the
> > configuration, I have changed to use the below configuration based on
> your
> > suggestion.
> >
> > 
> >content
> >[ \t]*\r?\n
> >br
> >true
> > 
> > 
> >content
> >(brbr){3,}
> >brbr
> >true
> > 
> >
> > However, the result that I get still has more than 2 . In fact, the
> > result become worse, as you can see from the comparison below.
> >
> > Example 1: The sentence that the regex pattern used to work correctly.
> But
> > with the latest pattern, it has now changed from 2  to become 5 ,
> > which is wrong.
> > *Original content in EML file:*
> > Dear Sir,
> >
> >
> > I am terminating
> > *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> > *Previous Index content: *Dear Sir,  I am terminating
> > *Current Index content*:   Dear Sir,  I am
> terminating
> >
> > Example 2: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 , there are 4 )
> > *Original content in EML file:*
> >
> > *exalted*
> >
> > *Psalm 89:17*
> >
> >
> > 3 Choa Chu Kang Avenue 4
> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> > Chu Kang Avenue 4, Singapore
> > *Previous Index content: *exalted  Psalm 89:17   
> > 3 Choa Chu Kang Avenue 4, Singapore
> > *Current Index content*:Psalm 89:173
> > Choa Chu Kang Avenue 3, Singapor4
> >
> > Example 3: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 , there are 4 ). For the latest
> code,
> > there are now 5 
> > *Original content in EML file:*
> >
> > http://www.concorded.com/
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Dec 18, 2018 at 10:07 AM
> > *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
> > 10:07 AM
> > *Previous Index content: *http://www.concorded.com/   
> > On Tue, Dec 18, 2018 at 10:07 AM
> > *Current Index content:* http://www.concorded.com/  
> > On Tue, Dec 18, 2018 at 10:07 AM
> >
> >
> > Regards,
> > Edwin
> >
> > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo 
> > wrote:
> >
> >> Hi Paul,
> >>
> >> Thank you for the reply.
> >>
> >> I have tried to add the following configuration according to your
> >> suggestion:
> >>
> >> 
> >>content
> >>[ \t]*\r?\n}
> >>br
> >>true
> >> 
> >>
> >> 
> >>content
> >>(brbr

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-06 Thread paul.dodd
Hi Edwin



You are correct  re the 2nd pattern – my bad. Looking at the 4 , it’s 
actually the sequence «  »? So perhaps the first match pattern 
could be [ \t\x0b\f]*\r?\n



i.e. [space tab vertical-tab formfeed]



Regards,

Paul



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
Gesendet: Mittwoch, 6. März 2019 07:44
An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Paul,

I have modified the second pattern to be (br){3,}, instead of
(brbr){3,}. This pattern of  (brbr){3,}
will actually look for 6 or more  instead of 3 ,  as we have put
the  two times in the pattern, which is the reason that there are more
 in the result, as cases where there are less than 6  are not being
replaced, so we ended up having up to 5  in the index.

Modified configuration:
 
   content
   (br){3,}
   brbr
   true
 

This will bring us back to the result of the previous index content,
meaning the issue of having the 4  is still there.

Regards,
Edwin



Regards,
Edwin

On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> Further to my previous email, which there was an extra "}" in the
> configuration, I have changed to use the below configuration based on your
> suggestion.
>
> 
>content
>[ \t]*\r?\n
>br
>true
> 
> 
>content
>(brbr){3,}
>brbr
>true
> 
>
> However, the result that I get still has more than 2 . In fact, the
> result become worse, as you can see from the comparison below.
>
> Example 1: The sentence that the regex pattern used to work correctly. But
> with the latest pattern, it has now changed from 2  to become 5 ,
> which is wrong.
> *Original content in EML file:*
> Dear Sir,
>
>
> I am terminating
> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> *Previous Index content: *Dear Sir,  I am terminating
> *Current Index content*:   Dear Sir,  I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content in EML file:*
>
> *exalted*
>
> *Psalm 89:17*
>
>
> 3 Choa Chu Kang Avenue 4
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Previous Index content: *exalted  Psalm 89:17   
> 3 Choa Chu Kang Avenue 4, Singapore
> *Current Index content*:Psalm 89:173
> Choa Chu Kang Avenue 3, Singapor4
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 ). For the latest code,
> there are now 5 
> *Original content in EML file:*
>
> http://www.concorded.com/
>
>
>
>
>
>
>
>
> On Tue, Dec 18, 2018 at 10:07 AM
> *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
> 10:07 AM
> *Previous Index content: *http://www.concorded.com/   
> On Tue, Dec 18, 2018 at 10:07 AM
> *Current Index content:* http://www.concorded.com/  
> On Tue, Dec 18, 2018 at 10:07 AM
>
>
> Regards,
> Edwin
>
> On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Paul,
>>
>> Thank you for the reply.
>>
>> I have tried to add the following configuration according to your
>> suggestion:
>>
>> 
>>content
>>[ \t]*\r?\n}
>>br
>>true
>> 
>>
>> 
>>content
>>(brbr){3,}
>>brbr
>>true
>> 
>>
>> However, none of the \n is being removed this time round.
>> Is the order and/or the pattern correct?
>>
>> Regards,
>> Edwin
>>
>> On Tue, 5 Mar 2019 at 19:54,  wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>> Try for the first pattern/replacement
>>>
>>>
>>>
>>> [ \t]*\r?\n
>>>
>>> br
>>>
>>>
>>>
>>> Now all line endings and preceding whitespace characters should be
>>> changed to ‘’.
>>>
>>>
>>>
>>> The second pattern replacement should replace 3 or more ‘’ sequences
>>> to 2 ‘’ sequences:
>>>
>>>
>>>
>>> (brbr){3,}
>>>
>>> brbr
>>>
>>>
>>>
>>> Hope this approach works. Sorry for not replying earlier and best
>>> regards,
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>>
>>> Gese

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-05 Thread Zheng Lin Edwin Yeo
Hi Paul,

I have modified the second pattern to be (br){3,}, instead of
(brbr){3,}. This pattern of  (brbr){3,}
will actually look for 6 or more  instead of 3 ,  as we have put
the  two times in the pattern, which is the reason that there are more
 in the result, as cases where there are less than 6  are not being
replaced, so we ended up having up to 5  in the index.

Modified configuration:
 
   content
   (br){3,}
   brbr
   true
 

This will bring us back to the result of the previous index content,
meaning the issue of having the 4  is still there.

Regards,
Edwin



Regards,
Edwin

On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> Further to my previous email, which there was an extra "}" in the
> configuration, I have changed to use the below configuration based on your
> suggestion.
>
> 
>content
>[ \t]*\r?\n
>br
>true
> 
> 
>content
>(brbr){3,}
>brbr
>true
> 
>
> However, the result that I get still has more than 2 . In fact, the
> result become worse, as you can see from the comparison below.
>
> Example 1: The sentence that the regex pattern used to work correctly. But
> with the latest pattern, it has now changed from 2  to become 5 ,
> which is wrong.
> *Original content in EML file:*
> Dear Sir,
>
>
> I am terminating
> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> *Previous Index content: *Dear Sir,  I am terminating
> *Current Index content*:   Dear Sir,  I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content in EML file:*
>
> *exalted*
>
> *Psalm 89:17*
>
>
> 3 Choa Chu Kang Avenue 4
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Previous Index content: *exalted  Psalm 89:17   
> 3 Choa Chu Kang Avenue 4, Singapore
> *Current Index content*:Psalm 89:173
> Choa Chu Kang Avenue 3, Singapor4
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 ). For the latest code,
> there are now 5 
> *Original content in EML file:*
>
> http://www.concorded.com/
>
>
>
>
>
>
>
>
> On Tue, Dec 18, 2018 at 10:07 AM
> *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
> 10:07 AM
> *Previous Index content: *http://www.concorded.com/   
> On Tue, Dec 18, 2018 at 10:07 AM
> *Current Index content:* http://www.concorded.com/  
> On Tue, Dec 18, 2018 at 10:07 AM
>
>
> Regards,
> Edwin
>
> On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Paul,
>>
>> Thank you for the reply.
>>
>> I have tried to add the following configuration according to your
>> suggestion:
>>
>> 
>>content
>>[ \t]*\r?\n}
>>br
>>true
>> 
>>
>> 
>>content
>>(brbr){3,}
>>brbr
>>true
>> 
>>
>> However, none of the \n is being removed this time round.
>> Is the order and/or the pattern correct?
>>
>> Regards,
>> Edwin
>>
>> On Tue, 5 Mar 2019 at 19:54,  wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>> Try for the first pattern/replacement
>>>
>>>
>>>
>>> [ \t]*\r?\n
>>>
>>> br
>>>
>>>
>>>
>>> Now all line endings and preceding whitespace characters should be
>>> changed to ‘’.
>>>
>>>
>>>
>>> The second pattern replacement should replace 3 or more ‘’ sequences
>>> to 2 ‘’ sequences:
>>>
>>>
>>>
>>> (brbr){3,}
>>>
>>> brbr
>>>
>>>
>>>
>>> Hope this approach works. Sorry for not replying earlier and best
>>> regards,
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> Gesendet: Dienstag, 5. März 2019 03:35
>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi,
>>>
>>> For your info, this issue is occurring in the new Solr 

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-05 Thread Zheng Lin Edwin Yeo
Hi Paul,

Further to my previous email, which there was an extra "}" in the
configuration, I have changed to use the below configuration based on your
suggestion.


   content
   [ \t]*\r?\n
   br
   true


   content
   (brbr){3,}
   brbr
   true


However, the result that I get still has more than 2 . In fact, the
result become worse, as you can see from the comparison below.

Example 1: The sentence that the regex pattern used to work correctly. But
with the latest pattern, it has now changed from 2  to become 5 ,
which is wrong.
*Original content in EML file:*
Dear Sir,


I am terminating
*Original content:*Dear Sir,  \n\n \n \n\n I am terminating
*Previous Index content: *Dear Sir,  I am terminating
*Current Index content*:   Dear Sir,  I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content in EML file:*

*exalted*

*Psalm 89:17*


3 Choa Chu Kang Avenue 4
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Previous Index content: *exalted  Psalm 89:17   
3 Choa Chu Kang Avenue 4, Singapore
*Current Index content*:Psalm 89:173
Choa Chu Kang Avenue 3, Singapor4

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 ). For the latest code,
there are now 5 
*Original content in EML file:*

http://www.concorded.com/








On Tue, Dec 18, 2018 at 10:07 AM
*Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
10:07 AM
*Previous Index content: *http://www.concorded.com/ On
Tue, Dec 18, 2018 at 10:07 AM
*Current Index content:* http://www.concorded.com/  
On Tue, Dec 18, 2018 at 10:07 AM


Regards,
Edwin

On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> Thank you for the reply.
>
> I have tried to add the following configuration according to your
> suggestion:
>
> 
>content
>[ \t]*\r?\n}
>br
>true
> 
>
> 
>content
>(brbr){3,}
>brbr
>true
> 
>
> However, none of the \n is being removed this time round.
> Is the order and/or the pattern correct?
>
> Regards,
> Edwin
>
> On Tue, 5 Mar 2019 at 19:54,  wrote:
>
>> Hi Edwin
>>
>>
>>
>> Try for the first pattern/replacement
>>
>>
>>
>> [ \t]*\r?\n
>>
>> br
>>
>>
>>
>> Now all line endings and preceding whitespace characters should be
>> changed to ‘’.
>>
>>
>>
>> The second pattern replacement should replace 3 or more ‘’ sequences
>> to 2 ‘’ sequences:
>>
>>
>>
>> (brbr){3,}
>>
>> brbr
>>
>>
>>
>> Hope this approach works. Sorry for not replying earlier and best regards,
>>
>> Paul
>>
>>
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> Gesendet: Dienstag, 5. März 2019 03:35
>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi,
>>
>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
>>
>> Regards,
>> Edwin
>>
>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo 
>> wrote:
>>
>> > Hi,
>> >
>> > Anyone else has other suggestions or have faced the same problem?
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo > >
>> > wrote:
>> >
>> >> Hi Paul,
>> >>
>> >> If I tried to execute the second step first, then I will only get a
>> >> single  for those with 2 .
>> >> For those that we originally get 4 , there will be 2  with a
>> >> space in between.
>> >>
>> >> This is just changing the 2  to be a single , since the second
>> >> step is to replace with a single .
>> >> But it has not solved the underlying problem yet.
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>> >> On Wed, 20 Feb 2019 at 16:41,  wrote:
>> >>
>> >>> If the second step is executed first, then you will get the unwanted 4
>> >>> 
>> >>>
>> >>>
>> >>>
>> >>> Gesendet von Mail<https://go.microsoft.

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-05 Thread Zheng Lin Edwin Yeo
Hi Paul,

Thank you for the reply.

I have tried to add the following configuration according to your
suggestion:


   content
   [ \t]*\r?\n}
   br
   true



   content
   (brbr){3,}
   brbr
   true


However, none of the \n is being removed this time round.
Is the order and/or the pattern correct?

Regards,
Edwin

On Tue, 5 Mar 2019 at 19:54,  wrote:

> Hi Edwin
>
>
>
> Try for the first pattern/replacement
>
>
>
> [ \t]*\r?\n
>
> br
>
>
>
> Now all line endings and preceding whitespace characters should be changed
> to ‘’.
>
>
>
> The second pattern replacement should replace 3 or more ‘’ sequences
> to 2 ‘’ sequences:
>
>
>
> (brbr){3,}
>
> brbr
>
>
>
> Hope this approach works. Sorry for not replying earlier and best regards,
>
> Paul
>
>
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Dienstag, 5. März 2019 03:35
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> For your info, this issue is occurring in the new Solr 7.7.1 as well.
>
> Regards,
> Edwin
>
> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo 
> wrote:
>
> > Hi,
> >
> > Anyone else has other suggestions or have faced the same problem?
> >
> > Regards,
> > Edwin
> >
> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo 
> > wrote:
> >
> >> Hi Paul,
> >>
> >> If I tried to execute the second step first, then I will only get a
> >> single  for those with 2 .
> >> For those that we originally get 4 , there will be 2  with a
> >> space in between.
> >>
> >> This is just changing the 2  to be a single , since the second
> >> step is to replace with a single .
> >> But it has not solved the underlying problem yet.
> >>
> >> Regards,
> >> Edwin
> >>
> >>
> >> On Wed, 20 Feb 2019 at 16:41,  wrote:
> >>
> >>> If the second step is executed first, then you will get the unwanted 4
> >>> 
> >>>
> >>>
> >>>
> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> >>> Windows 10
> >>>
> >>>
> >>>
> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
> >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >>>
> >>>
> >>>
> >>> Hi Jörn ,
> >>>
> >>> Do you mean the regex is not correct?
> >>>
> >>> We are already using two RegexReplaceProcessorFactory steps, like the
> one
> >>> shown below. The output that we get is still the same.
> >>>
> >>> 
> >>>  content
> >>>  ([ \t]*\r?\n){2,}
> >>>  brbr
> >>>  true
> >>> 
> >>>
> >>> 
> >>>  content
> >>>  ([ \t]*\r?\n){1,}
> >>>  br
> >>>  true
> >>> 
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke 
> wrote:
> >>>
> >>> > Then you need two regexprocessfactory steps
> >>> >
> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
> >>> edwinye...@gmail.com
> >>> > >:
> >>> > >
> >>> > > Hi,
> >>> > >
> >>> > > Thanks for the reply.
> >>> > >
> >>> > > Do you know of any regex online tool that works correctly for Java
> >>> regex?
> >>> > > I tried to find some, but they are not working properly.
> >>> > >
> >>> > > Yes, our plan is to replace more than one \n with , and
> >>> single \n
> >>> > > with single .
> >>> > >
> >>> > > Regards,
> >>> > > Edwin
> >>> > >
> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke 
> >>> wrote:
> >>> > >>
> >>> > >> Solr use

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-05 Thread paul.dodd
Hi Edwin



Try for the first pattern/replacement



[ \t]*\r?\n

br



Now all line endings and preceding whitespace characters should be changed to 
‘’.



The second pattern replacement should replace 3 or more ‘’ sequences to 2 
‘’ sequences:



(brbr){3,}

brbr



Hope this approach works. Sorry for not replying earlier and best regards,

Paul





Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
Gesendet: Dienstag, 5. März 2019 03:35
An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi,

For your info, this issue is occurring in the new Solr 7.7.1 as well.

Regards,
Edwin

On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> Anyone else has other suggestions or have faced the same problem?
>
> Regards,
> Edwin
>
> On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Paul,
>>
>> If I tried to execute the second step first, then I will only get a
>> single  for those with 2 .
>> For those that we originally get 4 , there will be 2  with a
>> space in between.
>>
>> This is just changing the 2  to be a single , since the second
>> step is to replace with a single .
>> But it has not solved the underlying problem yet.
>>
>> Regards,
>> Edwin
>>
>>
>> On Wed, 20 Feb 2019 at 16:41,  wrote:
>>
>>> If the second step is executed first, then you will get the unwanted 4
>>> 
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Jörn ,
>>>
>>> Do you mean the regex is not correct?
>>>
>>> We are already using two RegexReplaceProcessorFactory steps, like the one
>>> shown below. The output that we get is still the same.
>>>
>>> 
>>>  content
>>>  ([ \t]*\r?\n){2,}
>>>  brbr
>>>  true
>>> 
>>>
>>> 
>>>  content
>>>  ([ \t]*\r?\n){1,}
>>>  br
>>>  true
>>> 
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke  wrote:
>>>
>>> > Then you need two regexprocessfactory steps
>>> >
>>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>>> edwinye...@gmail.com
>>> > >:
>>> > >
>>> > > Hi,
>>> > >
>>> > > Thanks for the reply.
>>> > >
>>> > > Do you know of any regex online tool that works correctly for Java
>>> regex?
>>> > > I tried to find some, but they are not working properly.
>>> > >
>>> > > Yes, our plan is to replace more than one \n with , and
>>> single \n
>>> > > with single .
>>> > >
>>> > > Regards,
>>> > > Edwin
>>> > >
>>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke 
>>> wrote:
>>> > >>
>>> > >> Solr uses Java regex matching, so i doubt there is a bug - it would
>>> then
>>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>>> regex
>>> > for
>>> > >> your solution.
>>> > >>
>>> > >> I believe you want to have 2 regex process factories:
>>> > >> One that deals with single \n and one that deals with more than one
>>> \n
>>> > >>
>>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>>> > edwinye...@gmail.com
>>> > >>> :
>>> > >>>
>>> > >>> Hi,
>>> > >>>
>>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>>> > >>> configuration:
>>> > >>>
>>> > >>> 
>>> > >>>  content
>>> > >>>  ([ \t]*\r?\n){2,}
>>> > >>>  brbr
>>> > >>>  tru

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-04 Thread Zheng Lin Edwin Yeo
Hi,

For your info, this issue is occurring in the new Solr 7.7.1 as well.

Regards,
Edwin

On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> Anyone else has other suggestions or have faced the same problem?
>
> Regards,
> Edwin
>
> On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Paul,
>>
>> If I tried to execute the second step first, then I will only get a
>> single  for those with 2 .
>> For those that we originally get 4 , there will be 2  with a
>> space in between.
>>
>> This is just changing the 2  to be a single , since the second
>> step is to replace with a single .
>> But it has not solved the underlying problem yet.
>>
>> Regards,
>> Edwin
>>
>>
>> On Wed, 20 Feb 2019 at 16:41,  wrote:
>>
>>> If the second step is executed first, then you will get the unwanted 4
>>> 
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Jörn ,
>>>
>>> Do you mean the regex is not correct?
>>>
>>> We are already using two RegexReplaceProcessorFactory steps, like the one
>>> shown below. The output that we get is still the same.
>>>
>>> 
>>>  content
>>>  ([ \t]*\r?\n){2,}
>>>  brbr
>>>  true
>>> 
>>>
>>> 
>>>  content
>>>  ([ \t]*\r?\n){1,}
>>>  br
>>>  true
>>> 
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke  wrote:
>>>
>>> > Then you need two regexprocessfactory steps
>>> >
>>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>>> edwinye...@gmail.com
>>> > >:
>>> > >
>>> > > Hi,
>>> > >
>>> > > Thanks for the reply.
>>> > >
>>> > > Do you know of any regex online tool that works correctly for Java
>>> regex?
>>> > > I tried to find some, but they are not working properly.
>>> > >
>>> > > Yes, our plan is to replace more than one \n with , and
>>> single \n
>>> > > with single .
>>> > >
>>> > > Regards,
>>> > > Edwin
>>> > >
>>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke 
>>> wrote:
>>> > >>
>>> > >> Solr uses Java regex matching, so i doubt there is a bug - it would
>>> then
>>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>>> regex
>>> > for
>>> > >> your solution.
>>> > >>
>>> > >> I believe you want to have 2 regex process factories:
>>> > >> One that deals with single \n and one that deals with more than one
>>> \n
>>> > >>
>>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>>> > edwinye...@gmail.com
>>> > >>> :
>>> > >>>
>>> > >>> Hi,
>>> > >>>
>>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>>> > >>> configuration:
>>> > >>>
>>> > >>> 
>>> > >>>  content
>>> > >>>  ([ \t]*\r?\n){2,}
>>> > >>>  brbr
>>> > >>>  true
>>> > >>> 
>>> > >>>
>>> > >>> However, the issue is still occurring.
>>> > >>>
>>> > >>> Anyone else is able to help?
>>> > >>>
>>> > >>> Regards,
>>> > >>> Edwin
>>> > >>>
>>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>>> > edwinye...@gmail.com>
>>> > >>> wrote:
>>> > >>>
>>> > >>>> Hi,
>>> > >>>>
>>> > >>>> For your info, this issue is o

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-24 Thread Zheng Lin Edwin Yeo
Hi,

Anyone else has other suggestions or have faced the same problem?

Regards,
Edwin

On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> If I tried to execute the second step first, then I will only get a single
>  for those with 2 .
> For those that we originally get 4 , there will be 2  with a space
> in between.
>
> This is just changing the 2  to be a single , since the second
> step is to replace with a single .
> But it has not solved the underlying problem yet.
>
> Regards,
> Edwin
>
>
> On Wed, 20 Feb 2019 at 16:41,  wrote:
>
>> If the second step is executed first, then you will get the unwanted 4
>> 
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi Jörn ,
>>
>> Do you mean the regex is not correct?
>>
>> We are already using two RegexReplaceProcessorFactory steps, like the one
>> shown below. The output that we get is still the same.
>>
>> 
>>  content
>>  ([ \t]*\r?\n){2,}
>>  brbr
>>  true
>> 
>>
>> 
>>  content
>>  ([ \t]*\r?\n){1,}
>>  br
>>  true
>> 
>>
>> Regards,
>> Edwin
>>
>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke  wrote:
>>
>> > Then you need two regexprocessfactory steps
>> >
>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>> edwinye...@gmail.com
>> > >:
>> > >
>> > > Hi,
>> > >
>> > > Thanks for the reply.
>> > >
>> > > Do you know of any regex online tool that works correctly for Java
>> regex?
>> > > I tried to find some, but they are not working properly.
>> > >
>> > > Yes, our plan is to replace more than one \n with , and
>> single \n
>> > > with single .
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke 
>> wrote:
>> > >>
>> > >> Solr uses Java regex matching, so i doubt there is a bug - it would
>> then
>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>> regex
>> > for
>> > >> your solution.
>> > >>
>> > >> I believe you want to have 2 regex process factories:
>> > >> One that deals with single \n and one that deals with more than one
>> \n
>> > >>
>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>> > edwinye...@gmail.com
>> > >>> :
>> > >>>
>> > >>> Hi,
>> > >>>
>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>> > >>> configuration:
>> > >>>
>> > >>> 
>> > >>>  content
>> > >>>  ([ \t]*\r?\n){2,}
>> > >>>  brbr
>> > >>>  true
>> > >>> 
>> > >>>
>> > >>> However, the issue is still occurring.
>> > >>>
>> > >>> Anyone else is able to help?
>> > >>>
>> > >>> Regards,
>> > >>> Edwin
>> > >>>
>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>> > edwinye...@gmail.com>
>> > >>> wrote:
>> > >>>
>> > >>>> Hi,
>> > >>>>
>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>> > >>>>
>> > >>>> Regards,
>> > >>>> Edwin
>> > >>>>
>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>> > edwinye...@gmail.com
>> > >>>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> Hi,
>> > >>>>>
>> > >>>>> Should we report this as a bug in Solr?
>> > >>>>>
>> > >>>>> Regards,
>> > >>>>> Edwin
>> > >>>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-20 Thread Zheng Lin Edwin Yeo
Hi Paul,

If I tried to execute the second step first, then I will only get a single
 for those with 2 .
For those that we originally get 4 , there will be 2  with a space
in between.

This is just changing the 2  to be a single , since the second step
is to replace with a single .
But it has not solved the underlying problem yet.

Regards,
Edwin


On Wed, 20 Feb 2019 at 16:41,  wrote:

> If the second step is executed first, then you will get the unwanted 4 
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Mittwoch, 20. Februar 2019 09:29
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Jörn ,
>
> Do you mean the regex is not correct?
>
> We are already using two RegexReplaceProcessorFactory steps, like the one
> shown below. The output that we get is still the same.
>
> 
>  content
>  ([ \t]*\r?\n){2,}
>  brbr
>  true
> 
>
> 
>  content
>  ([ \t]*\r?\n){1,}
>  br
>  true
> 
>
> Regards,
> Edwin
>
> On Wed, 20 Feb 2019 at 16:03, Jörn Franke  wrote:
>
> > Then you need two regexprocessfactory steps
> >
> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >:
> > >
> > > Hi,
> > >
> > > Thanks for the reply.
> > >
> > > Do you know of any regex online tool that works correctly for Java
> regex?
> > > I tried to find some, but they are not working properly.
> > >
> > > Yes, our plan is to replace more than one \n with , and single
> \n
> > > with single .
> > >
> > > Regards,
> > > Edwin
> > >
> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke 
> wrote:
> > >>
> > >> Solr uses Java regex matching, so i doubt there is a bug - it would
> then
> > >> be in the JDK. Try out in a regex online Tool that supports Java regex
> > for
> > >> your solution.
> > >>
> > >> I believe you want to have 2 regex process factories:
> > >> One that deals with single \n and one that deals with more than one \n
> > >>
> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com
> > >>> :
> > >>>
> > >>> Hi,
> > >>>
> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > >>> configuration:
> > >>>
> > >>> 
> > >>>  content
> > >>>  ([ \t]*\r?\n){2,}
> > >>>  brbr
> > >>>  true
> > >>> 
> > >>>
> > >>> However, the issue is still occurring.
> > >>>
> > >>> Anyone else is able to help?
> > >>>
> > >>> Regards,
> > >>> Edwin
> > >>>
> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
> > >>>>
> > >>>> Regards,
> > >>>> Edwin
> > >>>>
> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com
> > >>>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> Should we report this as a bug in Solr?
> > >>>>>
> > >>>>> Regards,
> > >>>>> Edwin
> > >>>>>
> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Paul,
> > >>>>>>
> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in
> on
> > >>>>>> https://regex101.com/, it is able to give us the correct result
> for
> > >> all
> > >>>>>> the examples (ie: All of them will only have , and not
> more
> > >> than
> > >>>>>> that like what we are getting in Solr in our earlier exam

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-20 Thread paul.dodd
If the second step is executed first, then you will get the unwanted 4 



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
Gesendet: Mittwoch, 20. Februar 2019 09:29
An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Jörn ,

Do you mean the regex is not correct?

We are already using two RegexReplaceProcessorFactory steps, like the one
shown below. The output that we get is still the same.


 content
 ([ \t]*\r?\n){2,}
 brbr
 true



 content
 ([ \t]*\r?\n){1,}
 br
 true


Regards,
Edwin

On Wed, 20 Feb 2019 at 16:03, Jörn Franke  wrote:

> Then you need two regexprocessfactory steps
>
> > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo  >:
> >
> > Hi,
> >
> > Thanks for the reply.
> >
> > Do you know of any regex online tool that works correctly for Java regex?
> > I tried to find some, but they are not working properly.
> >
> > Yes, our plan is to replace more than one \n with , and single \n
> > with single .
> >
> > Regards,
> > Edwin
> >
> >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke  wrote:
> >>
> >> Solr uses Java regex matching, so i doubt there is a bug - it would then
> >> be in the JDK. Try out in a regex online Tool that supports Java regex
> for
> >> your solution.
> >>
> >> I believe you want to have 2 regex process factories:
> >> One that deals with single \n and one that deals with more than one \n
> >>
> >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> >>> :
> >>>
> >>> Hi,
> >>>
> >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> >>> configuration:
> >>>
> >>> 
> >>>  content
> >>>  ([ \t]*\r?\n){2,}
> >>>  brbr
> >>>  true
> >>> 
> >>>
> >>> However, the issue is still occurring.
> >>>
> >>> Anyone else is able to help?
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Should we report this as a bug in Solr?
> >>>>>
> >>>>> Regards,
> >>>>> Edwin
> >>>>>
> >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Paul,
> >>>>>>
> >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> >>>>>> https://regex101.com/, it is able to give us the correct result for
> >> all
> >>>>>> the examples (ie: All of them will only have , and not more
> >> than
> >>>>>> that like what we are getting in Solr in our earlier examples).
> >>>>>>
> >>>>>> Could there be a possibility of a bug in Solr?
> >>>>>>
> >>>>>> Regards,
> >>>>>> Edwin
> >>>>>>
> >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> >> edwinye...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Paul,
> >>>>>>>
> >>>>>>> We have tried it with the space preceeding the \n i.e.  >>>>>>> name="pattern">(\s*\n){2,}, with the following regex pattern:
> >>>>>>>
> >>>>>>> 
> >>>>>>>  content
> >>>>>>>  (\s*\n){2,}
> >>>>>>>  brbr
> >>>>>>> 
> >>>>>>>
> >>>>>>> Howev

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-20 Thread Zheng Lin Edwin Yeo
can refer to the
> >>>>>>> original content in the same examples below.
> >>>>>>>
> >>>>>>>
> >>>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>>> correctly
> >>>>>>> *Original content in EML file:*
> >>>>>>> Dear Sir,
> >>>>>>>
> >>>>>>>
> >>>>>>> I am terminating
> >>>>>>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>> *Index content: *Dear Sir,  I am terminating
> >>>>>>>
> >>>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>>> working (as you can see, instead of 2 , there are 4 )
> >>>>>>> *Original content in EML file:*
> >>>>>>>
> >>>>>>> *exalted*
> >>>>>>>
> >>>>>>> *Psalm 89:17*
> >>>>>>>
> >>>>>>>
> >>>>>>> 3 Choa Chu Kang Avenue 4
> >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>>> *Index content: *exalted  Psalm 89:17 3
> >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>>>
> >>>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>>> working (as you can see, instead of 2 , there are 4 )
> >>>>>>> *Original content in EML file:*
> >>>>>>>
> >>>>>>> http://www.concordpri.moe.edu.sg/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>  \n\n
> >> \n
> >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> >> Dec 18,
> >>>>>>> 2018 at 10:07 AM
> >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   
> >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>
> >>>>>>>
> >>>>>>> Appreciate any other ideas or suggestions that you may have.
> >>>>>>>
> >>>>>>> Thank you.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Edwin
> >>>>>>>
> >>>>>>>> On Thu, 7 Feb 2019 at 22:49,  wrote:
> >>>>>>>>
> >>>>>>>> Hi Edwin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed the \n
> >>>>>>>> i.e. (\s*\n){2,}
> >>>>>>>> 2.  Perhaps in the data you have other (non printing) characters
> >>>>>>>> than \n?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >> für
> >>>>>>>> Windows 10
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>>>>>>> An: solr-user@lucene.apache.org solr-user@lucene.apache.org>
> >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> >> multiple \n
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Paul,
> >>>>>>>>
> >>>>>>>> We have tried this suggested regex pattern as follow:
> >>>>>>>> 
> >>>>>>>>  content
> >>>>>>>>  (\n\s*){2,}
> >>&g

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-20 Thread Zheng Lin Edwin Yeo
Hi Paul,

I am using Java 1.8.0_201.

Regards,
Edwin

On Wed, 20 Feb 2019 at 16:01,  wrote:

> BTW, which Java Version are you using?
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Mittwoch, 20. Februar 2019 08:13
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> Thanks for the reply.
>
> Do you know of any regex online tool that works correctly for Java regex?
> I tried to find some, but they are not working properly.
>
> Yes, our plan is to replace more than one \n with , and single \n
> with single .
>
> Regards,
> Edwin
>
> On Wed, 20 Feb 2019 at 14:59, Jörn Franke  wrote:
>
> > Solr uses Java regex matching, so i doubt there is a bug - it would then
> > be in the JDK. Try out in a regex online Tool that supports Java regex
> for
> > your solution.
> >
> > I believe you want to have 2 regex process factories:
> > One that deals with single \n and one that deals with more than one \n
> >
> > > Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >:
> > >
> > > Hi,
> > >
> > > We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > > configuration:
> > >
> > > 
> > >   content
> > >   ([ \t]*\r?\n){2,}
> > >   brbr
> > >   true
> > > 
> > >
> > > However, the issue is still occurring.
> > >
> > > Anyone else is able to help?
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> For your info, this issue is occurring in Solr 7.7.0 as well.
> > >>
> > >> Regards,
> > >> Edwin
> > >>
> > >> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Should we report this as a bug in Solr?
> > >>>
> > >>> Regards,
> > >>> Edwin
> > >>>
> > >>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> Hi Paul,
> > >>>>
> > >>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> > >>>> https://regex101.com/, it is able to give us the correct result for
> > all
> > >>>> the examples (ie: All of them will only have , and not more
> > than
> > >>>> that like what we are getting in Solr in our earlier examples).
> > >>>>
> > >>>> Could there be a possibility of a bug in Solr?
> > >>>>
> > >>>> Regards,
> > >>>> Edwin
> > >>>>
> > >>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Paul,
> > >>>>>
> > >>>>> We have tried it with the space preceeding the \n i.e.  > >>>>> name="pattern">(\s*\n){2,}, with the following regex pattern:
> > >>>>>
> > >>>>> 
> > >>>>>   content
> > >>>>>   (\s*\n){2,}
> > >>>>>   brbr
> > >>>>> 
> > >>>>>
> > >>>>> However, we are also getting the exact same results as the earlier
> > >>>>> Example 1, 2 and 3.
> > >>>>>
> > >>>>> As for your point 2 on perhaps in the data you have other (non
> > >>>>> printing) characters than \n, we have find that there are no non
> > printing
> > >>>>> characters. It is just next line with a space. You can refer to the
> > >>>>> original content in the same examples below.
> > >>>>>
> > >>>>>
> > >>>>> Example 1: The sentence that the above regex pattern is working
> > >>>>> correctly
> > >>>>> *Original

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-20 Thread Jörn Franke
;> *Psalm 89:17*
>>>>>>> 
>>>>>>> 
>>>>>>> 3 Choa Chu Kang Avenue 4
>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>>>> *Index content: *exalted  Psalm 89:17 3
>>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>>>> 
>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>> working (as you can see, instead of 2 , there are 4 )
>>>>>>> *Original content in EML file:*
>>>>>>> 
>>>>>>> http://www.concordpri.moe.edu.sg/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>> \n
>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
>> Dec 18,
>>>>>>> 2018 at 10:07 AM
>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   
>>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> 
>>>>>>> 
>>>>>>> Appreciate any other ideas or suggestions that you may have.
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Edwin
>>>>>>> 
>>>>>>>> On Thu, 7 Feb 2019 at 22:49,  wrote:
>>>>>>>> 
>>>>>>>> Hi Edwin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>>>>>> i.e. (\s*\n){2,}
>>>>>>>> 2.  Perhaps in the data you have other (non printing) characters
>>>>>>>> than \n?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>> für
>>>>>>>> Windows 10
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> multiple \n
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Paul,
>>>>>>>> 
>>>>>>>> We have tried this suggested regex pattern as follow:
>>>>>>>> 
>>>>>>>>  content
>>>>>>>>  (\n\s*){2,}
>>>>>>>>  brbr
>>>>>>>> 
>>>>>>>> 
>>>>>>>> But we still have exactly the same problem of Example 1,2 and 3
>> below.
>>>>>>>> 
>>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>>>> correctly
>>>>>>>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>>> *Index content: *Dear Sir,  I am terminating
>>>>>>>> 
>>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>> (as you can see, instead of 2 , there are 4 )
>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>>>> Choa
>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>> *Index content: *exalted  Psalm 89:17 3
>>>>>>>> Choa
>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>> 
>>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>> (as you can see, instead of 2 , there are

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-20 Thread Jörn Franke
>>>>>>> *Psalm 89:17*
>>>>>>> 
>>>>>>> 
>>>>>>> 3 Choa Chu Kang Avenue 4
>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>>>> *Index content: *exalted  Psalm 89:17 3
>>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>>>> 
>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>> working (as you can see, instead of 2 , there are 4 )
>>>>>>> *Original content in EML file:*
>>>>>>> 
>>>>>>> http://www.concordpri.moe.edu.sg/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>> \n
>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
>> Dec 18,
>>>>>>> 2018 at 10:07 AM
>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   
>>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> 
>>>>>>> 
>>>>>>> Appreciate any other ideas or suggestions that you may have.
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Edwin
>>>>>>> 
>>>>>>>> On Thu, 7 Feb 2019 at 22:49,  wrote:
>>>>>>>> 
>>>>>>>> Hi Edwin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>>>>>> i.e. (\s*\n){2,}
>>>>>>>> 2.  Perhaps in the data you have other (non printing) characters
>>>>>>>> than \n?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>> für
>>>>>>>> Windows 10
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> multiple \n
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Paul,
>>>>>>>> 
>>>>>>>> We have tried this suggested regex pattern as follow:
>>>>>>>> 
>>>>>>>>  content
>>>>>>>>  (\n\s*){2,}
>>>>>>>>  brbr
>>>>>>>> 
>>>>>>>> 
>>>>>>>> But we still have exactly the same problem of Example 1,2 and 3
>> below.
>>>>>>>> 
>>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>>>> correctly
>>>>>>>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>>> *Index content: *Dear Sir,  I am terminating
>>>>>>>> 
>>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>> (as you can see, instead of 2 , there are 4 )
>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>>>> Choa
>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>> *Index content: *exalted  Psalm 89:17 3
>>>>>>>> Choa
>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>> 
>>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>> (as you can

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-20 Thread paul.dodd
BTW, which Java Version are you using?



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
Gesendet: Mittwoch, 20. Februar 2019 08:13
An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi,

Thanks for the reply.

Do you know of any regex online tool that works correctly for Java regex?
I tried to find some, but they are not working properly.

Yes, our plan is to replace more than one \n with , and single \n
with single .

Regards,
Edwin

On Wed, 20 Feb 2019 at 14:59, Jörn Franke  wrote:

> Solr uses Java regex matching, so i doubt there is a bug - it would then
> be in the JDK. Try out in a regex online Tool that supports Java regex for
> your solution.
>
> I believe you want to have 2 regex process factories:
> One that deals with single \n and one that deals with more than one \n
>
> > Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo  >:
> >
> > Hi,
> >
> > We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > configuration:
> >
> > 
> >   content
> >   ([ \t]*\r?\n){2,}
> >   brbr
> >   true
> > 
> >
> > However, the issue is still occurring.
> >
> > Anyone else is able to help?
> >
> > Regards,
> > Edwin
> >
> > On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo 
> > wrote:
> >
> >> Hi,
> >>
> >> For your info, this issue is occurring in Solr 7.7.0 as well.
> >>
> >> Regards,
> >> Edwin
> >>
> >> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo  >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Should we report this as a bug in Solr?
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo  >
> >>> wrote:
> >>>
> >>>> Hi Paul,
> >>>>
> >>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> >>>> https://regex101.com/, it is able to give us the correct result for
> all
> >>>> the examples (ie: All of them will only have , and not more
> than
> >>>> that like what we are getting in Solr in our earlier examples).
> >>>>
> >>>> Could there be a possibility of a bug in Solr?
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Paul,
> >>>>>
> >>>>> We have tried it with the space preceeding the \n i.e.  >>>>> name="pattern">(\s*\n){2,}, with the following regex pattern:
> >>>>>
> >>>>> 
> >>>>>   content
> >>>>>   (\s*\n){2,}
> >>>>>   brbr
> >>>>> 
> >>>>>
> >>>>> However, we are also getting the exact same results as the earlier
> >>>>> Example 1, 2 and 3.
> >>>>>
> >>>>> As for your point 2 on perhaps in the data you have other (non
> >>>>> printing) characters than \n, we have find that there are no non
> printing
> >>>>> characters. It is just next line with a space. You can refer to the
> >>>>> original content in the same examples below.
> >>>>>
> >>>>>
> >>>>> Example 1: The sentence that the above regex pattern is working
> >>>>> correctly
> >>>>> *Original content in EML file:*
> >>>>> Dear Sir,
> >>>>>
> >>>>>
> >>>>> I am terminating
> >>>>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> >>>>> *Index content: *Dear Sir,  I am terminating
> >>>>>
> >>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>> working (as you can see, instead of 2 , there are 4 )
> >>>>> *Original content in EML file:*
> >>>>>
> >>>>> *exalted*
> >>>>>
> >>>>> *Psalm 89:17*
> >>>>>
> >>>>>
> >>>>> 3 Choa Chu Kang Avenue 4
> >>>>> *Original content:* exalte

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-19 Thread Zheng Lin Edwin Yeo
Hi,

Thanks for the reply.

Do you know of any regex online tool that works correctly for Java regex?
I tried to find some, but they are not working properly.

Yes, our plan is to replace more than one \n with , and single \n
with single .

Regards,
Edwin

On Wed, 20 Feb 2019 at 14:59, Jörn Franke  wrote:

> Solr uses Java regex matching, so i doubt there is a bug - it would then
> be in the JDK. Try out in a regex online Tool that supports Java regex for
> your solution.
>
> I believe you want to have 2 regex process factories:
> One that deals with single \n and one that deals with more than one \n
>
> > Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo  >:
> >
> > Hi,
> >
> > We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > configuration:
> >
> > 
> >   content
> >   ([ \t]*\r?\n){2,}
> >   brbr
> >   true
> > 
> >
> > However, the issue is still occurring.
> >
> > Anyone else is able to help?
> >
> > Regards,
> > Edwin
> >
> > On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo 
> > wrote:
> >
> >> Hi,
> >>
> >> For your info, this issue is occurring in Solr 7.7.0 as well.
> >>
> >> Regards,
> >> Edwin
> >>
> >> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo  >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Should we report this as a bug in Solr?
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo  >
> >>> wrote:
> >>>
> >>>> Hi Paul,
> >>>>
> >>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> >>>> https://regex101.com/, it is able to give us the correct result for
> all
> >>>> the examples (ie: All of them will only have , and not more
> than
> >>>> that like what we are getting in Solr in our earlier examples).
> >>>>
> >>>> Could there be a possibility of a bug in Solr?
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Paul,
> >>>>>
> >>>>> We have tried it with the space preceeding the \n i.e.  >>>>> name="pattern">(\s*\n){2,}, with the following regex pattern:
> >>>>>
> >>>>> 
> >>>>>   content
> >>>>>   (\s*\n){2,}
> >>>>>   brbr
> >>>>> 
> >>>>>
> >>>>> However, we are also getting the exact same results as the earlier
> >>>>> Example 1, 2 and 3.
> >>>>>
> >>>>> As for your point 2 on perhaps in the data you have other (non
> >>>>> printing) characters than \n, we have find that there are no non
> printing
> >>>>> characters. It is just next line with a space. You can refer to the
> >>>>> original content in the same examples below.
> >>>>>
> >>>>>
> >>>>> Example 1: The sentence that the above regex pattern is working
> >>>>> correctly
> >>>>> *Original content in EML file:*
> >>>>> Dear Sir,
> >>>>>
> >>>>>
> >>>>> I am terminating
> >>>>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> >>>>> *Index content: *Dear Sir,  I am terminating
> >>>>>
> >>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>> working (as you can see, instead of 2 , there are 4 )
> >>>>> *Original content in EML file:*
> >>>>>
> >>>>> *exalted*
> >>>>>
> >>>>> *Psalm 89:17*
> >>>>>
> >>>>>
> >>>>> 3 Choa Chu Kang Avenue 4
> >>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> >>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>> *Index content: *exalted  Psalm 89:17 3
> >>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>
> >>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>> working (as you 

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-19 Thread Jörn Franke
Solr uses Java regex matching, so i doubt there is a bug - it would then be in 
the JDK. Try out in a regex online Tool that supports Java regex for your 
solution.

I believe you want to have 2 regex process factories:
One that deals with single \n and one that deals with more than one \n

> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo :
> 
> Hi,
> 
> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> configuration:
> 
> 
>   content
>   ([ \t]*\r?\n){2,}
>   brbr
>   true
> 
> 
> However, the issue is still occurring.
> 
> Anyone else is able to help?
> 
> Regards,
> Edwin
> 
> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo 
> wrote:
> 
>> Hi,
>> 
>> For your info, this issue is occurring in Solr 7.7.0 as well.
>> 
>> Regards,
>> Edwin
>> 
>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo 
>> wrote:
>> 
>>> Hi,
>>> 
>>> Should we report this as a bug in Solr?
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo 
>>> wrote:
>>> 
>>>> Hi Paul,
>>>> 
>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>>>> https://regex101.com/, it is able to give us the correct result for all
>>>> the examples (ie: All of them will only have , and not more than
>>>> that like what we are getting in Solr in our earlier examples).
>>>> 
>>>> Could there be a possibility of a bug in Solr?
>>>> 
>>>> Regards,
>>>> Edwin
>>>> 
>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo 
>>>> wrote:
>>>> 
>>>>> Hi Paul,
>>>>> 
>>>>> We have tried it with the space preceeding the \n i.e. >>>> name="pattern">(\s*\n){2,}, with the following regex pattern:
>>>>> 
>>>>> 
>>>>>   content
>>>>>   (\s*\n){2,}
>>>>>   brbr
>>>>> 
>>>>> 
>>>>> However, we are also getting the exact same results as the earlier
>>>>> Example 1, 2 and 3.
>>>>> 
>>>>> As for your point 2 on perhaps in the data you have other (non
>>>>> printing) characters than \n, we have find that there are no non printing
>>>>> characters. It is just next line with a space. You can refer to the
>>>>> original content in the same examples below.
>>>>> 
>>>>> 
>>>>> Example 1: The sentence that the above regex pattern is working
>>>>> correctly
>>>>> *Original content in EML file:*
>>>>> Dear Sir,
>>>>> 
>>>>> 
>>>>> I am terminating
>>>>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>>>>> *Index content: *Dear Sir,  I am terminating
>>>>> 
>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>> working (as you can see, instead of 2 , there are 4 )
>>>>> *Original content in EML file:*
>>>>> 
>>>>> *exalted*
>>>>> 
>>>>> *Psalm 89:17*
>>>>> 
>>>>> 
>>>>> 3 Choa Chu Kang Avenue 4
>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>> *Index content: *exalted  Psalm 89:17 3
>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>> 
>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>> working (as you can see, instead of 2 , there are 4 )
>>>>> *Original content in EML file:*
>>>>> 
>>>>> http://www.concordpri.moe.edu.sg/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 
>>>>> 18,
>>>>> 2018 at 10:07 AM
>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   
>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>> 
>>>>> 
>>>>> Appreciate any other ideas or suggestions that you may hav

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-19 Thread Zheng Lin Edwin Yeo
Hi,

We have tried with the following pattern ([ \t]*\r?\n){2,} and
configuration:


   content
   ([ \t]*\r?\n){2,}
   brbr
   true


However, the issue is still occurring.

Anyone else is able to help?

Regards,
Edwin

On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> For your info, this issue is occurring in Solr 7.7.0 as well.
>
> Regards,
> Edwin
>
> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi,
>>
>> Should we report this as a bug in Solr?
>>
>> Regards,
>> Edwin
>>
>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo 
>> wrote:
>>
>>> Hi Paul,
>>>
>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>>> https://regex101.com/, it is able to give us the correct result for all
>>> the examples (ie: All of them will only have , and not more than
>>> that like what we are getting in Solr in our earlier examples).
>>>
>>> Could there be a possibility of a bug in Solr?
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo 
>>> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> We have tried it with the space preceeding the \n i.e. >>> name="pattern">(\s*\n){2,}, with the following regex pattern:
>>>>
>>>> 
>>>>content
>>>>(\s*\n){2,}
>>>>brbr
>>>> 
>>>>
>>>> However, we are also getting the exact same results as the earlier
>>>> Example 1, 2 and 3.
>>>>
>>>> As for your point 2 on perhaps in the data you have other (non
>>>> printing) characters than \n, we have find that there are no non printing
>>>> characters. It is just next line with a space. You can refer to the
>>>> original content in the same examples below.
>>>>
>>>>
>>>> Example 1: The sentence that the above regex pattern is working
>>>> correctly
>>>> *Original content in EML file:*
>>>> Dear Sir,
>>>>
>>>>
>>>> I am terminating
>>>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>>>> *Index content: *Dear Sir,  I am terminating
>>>>
>>>> Example 2: The sentence that the above regex pattern is partially
>>>> working (as you can see, instead of 2 , there are 4 )
>>>> *Original content in EML file:*
>>>>
>>>> *exalted*
>>>>
>>>> *Psalm 89:17*
>>>>
>>>>
>>>> 3 Choa Chu Kang Avenue 4
>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>> Choa Chu Kang Avenue 4, Singapore
>>>> *Index content: *exalted  Psalm 89:17 3
>>>> Choa Chu Kang Avenue 4, Singapore
>>>>
>>>> Example 3: The sentence that the above regex pattern is partially
>>>> working (as you can see, instead of 2 , there are 4 )
>>>> *Original content in EML file:*
>>>>
>>>> http://www.concordpri.moe.edu.sg/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>> 2018 at 10:07 AM
>>>> *Index content: *http://www.concordpri.moe.edu.sg/   
>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>
>>>>
>>>> Appreciate any other ideas or suggestions that you may have.
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>> On Thu, 7 Feb 2019 at 22:49,  wrote:
>>>>
>>>>> Hi Edwin
>>>>>
>>>>>
>>>>>
>>>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>>> i.e. (\s*\n){2,}
>>>>>   2.  Perhaps in the data you have other (non printing) characters
>>>>> than \n?
>>>>>
>>>>>
>>>>>
>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>>> Windows 10
>>>>>
>>>>>
>>>>>
>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>>>> Gesendet: Donnerstag, 7. Feb

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-14 Thread Zheng Lin Edwin Yeo
Hi,

For your info, this issue is occurring in Solr 7.7.0 as well.

Regards,
Edwin

On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> Should we report this as a bug in Solr?
>
> Regards,
> Edwin
>
> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Paul,
>>
>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>> https://regex101.com/, it is able to give us the correct result for all
>> the examples (ie: All of them will only have , and not more than
>> that like what we are getting in Solr in our earlier examples).
>>
>> Could there be a possibility of a bug in Solr?
>>
>> Regards,
>> Edwin
>>
>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo 
>> wrote:
>>
>>> Hi Paul,
>>>
>>> We have tried it with the space preceeding the \n i.e. >> name="pattern">(\s*\n){2,}, with the following regex pattern:
>>>
>>> 
>>>content
>>>(\s*\n){2,}
>>>brbr
>>> 
>>>
>>> However, we are also getting the exact same results as the earlier
>>> Example 1, 2 and 3.
>>>
>>> As for your point 2 on perhaps in the data you have other (non printing)
>>> characters than \n, we have find that there are no non printing characters.
>>> It is just next line with a space. You can refer to the original content in
>>> the same examples below.
>>>
>>>
>>> Example 1: The sentence that the above regex pattern is working
>>> correctly
>>> *Original content in EML file:*
>>> Dear Sir,
>>>
>>>
>>> I am terminating
>>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>>> *Index content: *Dear Sir,  I am terminating
>>>
>>> Example 2: The sentence that the above regex pattern is partially
>>> working (as you can see, instead of 2 , there are 4 )
>>> *Original content in EML file:*
>>>
>>> *exalted*
>>>
>>> *Psalm 89:17*
>>>
>>>
>>> 3 Choa Chu Kang Avenue 4
>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>> Choa Chu Kang Avenue 4, Singapore
>>> *Index content: *exalted  Psalm 89:17 3
>>> Choa Chu Kang Avenue 4, Singapore
>>>
>>> Example 3: The sentence that the above regex pattern is partially
>>> working (as you can see, instead of 2 , there are 4 )
>>> *Original content in EML file:*
>>>
>>> http://www.concordpri.moe.edu.sg/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 18, 2018 at 10:07 AM
>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018 at 10:07 AM
>>> *Index content: *http://www.concordpri.moe.edu.sg/   
>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>
>>>
>>> Appreciate any other ideas or suggestions that you may have.
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Thu, 7 Feb 2019 at 22:49,  wrote:
>>>
>>>> Hi Edwin
>>>>
>>>>
>>>>
>>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>> i.e. (\s*\n){2,}
>>>>   2.  Perhaps in the data you have other (non printing) characters than
>>>> \n?
>>>>
>>>>
>>>>
>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> Windows 10
>>>>
>>>>
>>>>
>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>
>>>>
>>>>
>>>> Hi Paul,
>>>>
>>>> We have tried this suggested regex pattern as follow:
>>>> 
>>>>content
>>>>(\n\s*){2,}
>>>>brbr
>>>> 
>>>>
>>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>>
>>>> Example 1: The sentence that the above regex pattern is working
>>>> correctly
>>>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-11 Thread Zheng Lin Edwin Yeo
Hi,

Should we report this as a bug in Solr?

Regards,
Edwin

On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> https://regex101.com/, it is able to give us the correct result for all
> the examples (ie: All of them will only have , and not more than
> that like what we are getting in Solr in our earlier examples).
>
> Could there be a possibility of a bug in Solr?
>
> Regards,
> Edwin
>
> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Paul,
>>
>> We have tried it with the space preceeding the \n i.e. > name="pattern">(\s*\n){2,}, with the following regex pattern:
>>
>> 
>>content
>>(\s*\n){2,}
>>brbr
>> 
>>
>> However, we are also getting the exact same results as the earlier
>> Example 1, 2 and 3.
>>
>> As for your point 2 on perhaps in the data you have other (non printing)
>> characters than \n, we have find that there are no non printing characters.
>> It is just next line with a space. You can refer to the original content in
>> the same examples below.
>>
>>
>> Example 1: The sentence that the above regex pattern is working correctly
>> *Original content in EML file:*
>> Dear Sir,
>>
>>
>> I am terminating
>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>> *Index content: *Dear Sir,  I am terminating
>>
>> Example 2: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 , there are 4 )
>> *Original content in EML file:*
>>
>> *exalted*
>>
>> *Psalm 89:17*
>>
>>
>> 3 Choa Chu Kang Avenue 4
>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> Chu Kang Avenue 4, Singapore
>> *Index content: *exalted  Psalm 89:17 3 Choa
>> Chu Kang Avenue 4, Singapore
>>
>> Example 3: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 , there are 4 )
>> *Original content in EML file:*
>>
>> http://www.concordpri.moe.edu.sg/
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 18, 2018 at 10:07 AM
>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018 at 10:07 AM
>> *Index content: *http://www.concordpri.moe.edu.sg/   
>> On Tue, Dec 18, 2018 at 10:07 AM
>>
>>
>> Appreciate any other ideas or suggestions that you may have.
>>
>> Thank you.
>>
>> Regards,
>> Edwin
>>
>> On Thu, 7 Feb 2019 at 22:49,  wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
>>> (\s*\n){2,}
>>>   2.  Perhaps in the data you have other (non printing) characters than
>>> \n?
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Paul,
>>>
>>> We have tried this suggested regex pattern as follow:
>>> 
>>>content
>>>(\n\s*){2,}
>>>brbr
>>> 
>>>
>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>
>>> Example 1: The sentence that the above regex pattern is working correctly
>>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>>> *Index content: *Dear Sir,  I am terminating
>>>
>>> Example 2: The sentence that the above regex pattern is partially working
>>> (as you can see, instead of 2 , there are 4 )
>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>>> Chu Kang Avenue 4, Singapore
>>> *Index content: *exalted  Psalm 89:17 3 Choa
>>> Chu Kang Avenue 4, Singapore
>>>
>>> Example 3: The sentence that the above regex pattern is partially working
>>> (as you can see, instead of 2 , there are 4 )
>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> \

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-08 Thread Zheng Lin Edwin Yeo
Hi Paul,

Regarding the regex (\n\s*){2,} that we are using, when we try in on
https://regex101.com/, it is able to give us the correct result for all the
examples (ie: All of them will only have , and not more than that
like what we are getting in Solr in our earlier examples).

Could there be a possibility of a bug in Solr?

Regards,
Edwin

On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> We have tried it with the space preceeding the \n i.e.  name="pattern">(\s*\n){2,}, with the following regex pattern:
>
> 
>content
>(\s*\n){2,}
>brbr
> 
>
> However, we are also getting the exact same results as the earlier Example
> 1, 2 and 3.
>
> As for your point 2 on perhaps in the data you have other (non printing)
> characters than \n, we have find that there are no non printing characters.
> It is just next line with a space. You can refer to the original content in
> the same examples below.
>
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content in EML file:*
> Dear Sir,
>
>
> I am terminating
> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *Dear Sir,  I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content in EML file:*
>
> *exalted*
>
> *Psalm 89:17*
>
>
> 3 Choa Chu Kang Avenue 4
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  Psalm 89:17 3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content in EML file:*
>
> http://www.concordpri.moe.edu.sg/
>
>
>
>
>
>
>
>
> On Tue, Dec 18, 2018 at 10:07 AM
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
> 2018 at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/ On
> Tue, Dec 18, 2018 at 10:07 AM
>
>
> Appreciate any other ideas or suggestions that you may have.
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 22:49,  wrote:
>
>> Hi Edwin
>>
>>
>>
>>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
>> (\s*\n){2,}
>>   2.  Perhaps in the data you have other (non printing) characters than
>> \n?
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi Paul,
>>
>> We have tried this suggested regex pattern as follow:
>> 
>>content
>>(\n\s*){2,}
>>brbr
>> 
>>
>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>
>> Example 1: The sentence that the above regex pattern is working correctly
>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>> *Index content: *Dear Sir,  I am terminating
>>
>> Example 2: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 , there are 4 )
>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> Chu Kang Avenue 4, Singapore
>> *Index content: *exalted  Psalm 89:17 3 Choa
>> Chu Kang Avenue 4, Singapore
>>
>> Example 3: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 , there are 4 )
>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> \n\n
>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018
>> at 10:07 AM
>> *Index content: *http://www.concordpri.moe.edu.sg/ On
>> Tue, Dec 18, 2018 at 10:07 AM
>>
>> Any further suggestion?
>>
>> Thank you.
>>
>> Regards,
>> Edwin
>>
>> On Thu, 7 Feb 2019 at 22:20,  wrote:
>>
>> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
>> > part you could try
>> >
>> >
>> >
>> > (\n\s*){2,}
>> >
>> >
>> >
>> > If you also want to match CRLF then
>> >
&

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-07 Thread Zheng Lin Edwin Yeo
Hi Paul,

We have tried it with the space preceeding the \n i.e. (\s*\n){2,}, with the following regex pattern:


   content
   (\s*\n){2,}
   brbr


However, we are also getting the exact same results as the earlier Example
1, 2 and 3.

As for your point 2 on perhaps in the data you have other (non printing)
characters than \n, we have find that there are no non printing characters.
It is just next line with a space. You can refer to the original content in
the same examples below.


Example 1: The sentence that the above regex pattern is working correctly
*Original content in EML file:*
Dear Sir,


I am terminating
*Original content:*Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *Dear Sir,  I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content in EML file:*

*exalted*

*Psalm 89:17*


3 Choa Chu Kang Avenue 4
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  Psalm 89:17 3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content in EML file:*

http://www.concordpri.moe.edu.sg/








On Tue, Dec 18, 2018 at 10:07 AM
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/ On
Tue, Dec 18, 2018 at 10:07 AM


Appreciate any other ideas or suggestions that you may have.

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 22:49,  wrote:

> Hi Edwin
>
>
>
>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
> (\s*\n){2,}
>   2.  Perhaps in the data you have other (non printing) characters than \n?
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 15:23
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> We have tried this suggested regex pattern as follow:
> 
>content
>(\n\s*){2,}
>brbr
> 
>
> But we still have exactly the same problem of Example 1,2 and 3 below.
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *Dear Sir,  I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  Psalm 89:17 3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/ On
> Tue, Dec 18, 2018 at 10:07 AM
>
> Any further suggestion?
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 22:20,  wrote:
>
> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
> > part you could try
> >
> >
> >
> > (\n\s*){2,}
> >
> >
> >
> > If you also want to match CRLF then
> >
> > (\r?\n\s*){2,}
> >
> >
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> > Gesendet: Donnerstag, 7. Februar 2019 15:10
> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi Paul,
> >
> > Thanks for your reply.
> >
> > When I use this pattern:
> > 
> >content
> >(\n+\s*){2,}
> >brbr
> > 
> >
> > It is working for some sentence within the same content and not working
> for
> > some sentences. Please see below for the one that is working and another
> > that is not working (partially working):
> >
> > Example 1: The sentence that the above regex pattern is working correctly
> > *Ori

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-07 Thread paul.dodd
Hi Edwin



  1.  Sorry, the pattern was wrong, the space should preceed the \n i.e. (\s*\n){2,}
  2.  Perhaps in the data you have other (non printing) characters than \n?



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
Gesendet: Donnerstag, 7. Februar 2019 15:23
An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Paul,

We have tried this suggested regex pattern as follow:

   content
   (\n\s*){2,}
   brbr


But we still have exactly the same problem of Example 1,2 and 3 below.

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *Dear Sir,  I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  Psalm 89:17 3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/ On
Tue, Dec 18, 2018 at 10:07 AM

Any further suggestion?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 22:20,  wrote:

> To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
> part you could try
>
>
>
> (\n\s*){2,}
>
>
>
> If you also want to match CRLF then
>
> (\r?\n\s*){2,}
>
>
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 15:10
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> Thanks for your reply.
>
> When I use this pattern:
> 
>content
>(\n+\s*){2,}
>brbr
> 
>
> It is working for some sentence within the same content and not working for
> some sentences. Please see below for the one that is working and another
> that is not working (partially working):
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *Dear Sir,  I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  Psalm 89:17 3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/ On
> Tue, Dec 18, 2018 at 10:07 AM
>
> We would appreciate your help to see what is wrong?
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 21:24,  wrote:
>
> > You don’t say what happens, just that it is not working. I assume nothing
> > is replaced? Perhaps the pattern should be
> >
> >
> >
> >"(\n\s*){2,}"
> >
> >
> >
> > ??
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> > Gesendet: Donnerstag, 7. Februar 2019 14:08
> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi,
> >
> > I am trying to use the RegexReplaceProcessorFactory to remove more than
> two
> > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
> \n),
> > and replace it with two .
> >
> > I use the following regex pattern and it is working when I test it in
> > regex101.com. But it is not working when I put it inside the
> > RegexReplaceProcessorFactory as below:
> >
> > 
> > 
> >content
> >"(\\n\s*){2,}"
> >brbr
> > 
> >   
> >
> > To explain further about my regex pattern, \s* is instructing the regex
> to
> > match any \n that have space after and {2,} is instructing the regex to
> > match 2 or more occurrence of such pattern (\n).
> >
> > Please kindly let me know what is wrong and how should I do it?
> >
> > I am using Solr 7.6.0.
> >
> > Regards,
> > Edwin
> >
>


Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-07 Thread Zheng Lin Edwin Yeo
Hi Paul,

We have tried this suggested regex pattern as follow:

   content
   (\n\s*){2,}
   brbr


But we still have exactly the same problem of Example 1,2 and 3 below.

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *Dear Sir,  I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  Psalm 89:17 3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/ On
Tue, Dec 18, 2018 at 10:07 AM

Any further suggestion?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 22:20,  wrote:

> To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
> part you could try
>
>
>
> (\n\s*){2,}
>
>
>
> If you also want to match CRLF then
>
> (\r?\n\s*){2,}
>
>
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 15:10
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> Thanks for your reply.
>
> When I use this pattern:
> 
>content
>(\n+\s*){2,}
>brbr
> 
>
> It is working for some sentence within the same content and not working for
> some sentences. Please see below for the one that is working and another
> that is not working (partially working):
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *Dear Sir,  I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  Psalm 89:17 3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/ On
> Tue, Dec 18, 2018 at 10:07 AM
>
> We would appreciate your help to see what is wrong?
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 21:24,  wrote:
>
> > You don’t say what happens, just that it is not working. I assume nothing
> > is replaced? Perhaps the pattern should be
> >
> >
> >
> >"(\n\s*){2,}"
> >
> >
> >
> > ??
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> > Gesendet: Donnerstag, 7. Februar 2019 14:08
> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi,
> >
> > I am trying to use the RegexReplaceProcessorFactory to remove more than
> two
> > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
> \n),
> > and replace it with two .
> >
> > I use the following regex pattern and it is working when I test it in
> > regex101.com. But it is not working when I put it inside the
> > RegexReplaceProcessorFactory as below:
> >
> > 
> > 
> >content
> >"(\\n\s*){2,}"
> >brbr
> > 
> >   
> >
> > To explain further about my regex pattern, \s* is instructing the regex
> to
> > match any \n that have space after and {2,} is instructing the regex to
> > match 2 or more occurrence of such pattern (\n).
> >
> > Please kindly let me know what is wrong and how should I do it?
> >
> > I am using Solr 7.6.0.
> >
> > Regards,
> > Edwin
> >
>


AW: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-07 Thread paul.dodd
To avoid the «\n+\s*» matching too many \n and then failing on the {2,} part 
you could try



(\n\s*){2,}



If you also want to match CRLF then

(\r?\n\s*){2,}





Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
Gesendet: Donnerstag, 7. Februar 2019 15:10
An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Paul,

Thanks for your reply.

When I use this pattern:

   content
   (\n+\s*){2,}
   brbr


It is working for some sentence within the same content and not working for
some sentences. Please see below for the one that is working and another
that is not working (partially working):

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *Dear Sir,  I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  Psalm 89:17 3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/ On
Tue, Dec 18, 2018 at 10:07 AM

We would appreciate your help to see what is wrong?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 21:24,  wrote:

> You don’t say what happens, just that it is not working. I assume nothing
> is replaced? Perhaps the pattern should be
>
>
>
>"(\n\s*){2,}"
>
>
>
> ??
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 14:08
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> I am trying to use the RegexReplaceProcessorFactory to remove more than two
> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
> and replace it with two .
>
> I use the following regex pattern and it is working when I test it in
> regex101.com. But it is not working when I put it inside the
> RegexReplaceProcessorFactory as below:
>
> 
> 
>content
>"(\\n\s*){2,}"
>brbr
> 
>   
>
> To explain further about my regex pattern, \s* is instructing the regex to
> match any \n that have space after and {2,} is instructing the regex to
> match 2 or more occurrence of such pattern (\n).
>
> Please kindly let me know what is wrong and how should I do it?
>
> I am using Solr 7.6.0.
>
> Regards,
> Edwin
>


Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-07 Thread Zheng Lin Edwin Yeo
Hi Paul,

Thanks for your reply.

When I use this pattern:

   content
   (\n+\s*){2,}
   brbr


It is working for some sentence within the same content and not working for
some sentences. Please see below for the one that is working and another
that is not working (partially working):

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *Dear Sir,  I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  Psalm 89:17 3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 , there are 4 )
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/ On
Tue, Dec 18, 2018 at 10:07 AM

We would appreciate your help to see what is wrong?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 21:24,  wrote:

> You don’t say what happens, just that it is not working. I assume nothing
> is replaced? Perhaps the pattern should be
>
>
>
>"(\n\s*){2,}"
>
>
>
> ??
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 14:08
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> I am trying to use the RegexReplaceProcessorFactory to remove more than two
> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
> and replace it with two .
>
> I use the following regex pattern and it is working when I test it in
> regex101.com. But it is not working when I put it inside the
> RegexReplaceProcessorFactory as below:
>
> 
> 
>content
>"(\\n\s*){2,}"
>brbr
> 
>   
>
> To explain further about my regex pattern, \s* is instructing the regex to
> match any \n that have space after and {2,} is instructing the regex to
> match 2 or more occurrence of such pattern (\n).
>
> Please kindly let me know what is wrong and how should I do it?
>
> I am using Solr 7.6.0.
>
> Regards,
> Edwin
>


AW: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-07 Thread paul.dodd
You don’t say what happens, just that it is not working. I assume nothing is 
replaced? Perhaps the pattern should be



   "(\n\s*){2,}"



??



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
Gesendet: Donnerstag, 7. Februar 2019 14:08
An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi,

I am trying to use the RegexReplaceProcessorFactory to remove more than two
\n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
and replace it with two .

I use the following regex pattern and it is working when I test it in
regex101.com. But it is not working when I put it inside the
RegexReplaceProcessorFactory as below:



   content
   "(\\n\s*){2,}"
   brbr

  

To explain further about my regex pattern, \s* is instructing the regex to
match any \n that have space after and {2,} is instructing the regex to
match 2 or more occurrence of such pattern (\n).

Please kindly let me know what is wrong and how should I do it?

I am using Solr 7.6.0.

Regards,
Edwin


RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-07 Thread Zheng Lin Edwin Yeo
Hi,

I am trying to use the RegexReplaceProcessorFactory to remove more than two
\n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
and replace it with two .

I use the following regex pattern and it is working when I test it in
regex101.com. But it is not working when I put it inside the
RegexReplaceProcessorFactory as below:



   content
   "(\\n\s*){2,}"
   brbr

  

To explain further about my regex pattern, \s* is instructing the regex to
match any \n that have space after and {2,} is instructing the regex to
match 2 or more occurrence of such pattern (\n).

Please kindly let me know what is wrong and how should I do it?

I am using Solr 7.6.0.

Regards,
Edwin