Re: parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

2017-12-04 Thread H.-Dirk Schmitt
On Mo, 2017-12-04 at 16:49 -0500, Chet Ramey wrote:
> The thing is, bash doesn't "implement" its regular expressions, per
> se.
> Bash uses the Posix standard library functions (regcomp/regexec) if
> they
> are available in the C library when it's configured and built.  I'm
> not
> wild about adding a dependency on pcre, or a configure test for it,
> just
> to have two varieties of regular expressions available.
> 
> Chet

O.k.  – so close this as „not a bug“.


-- 




  
  

  Signature H.-Dirk Schmitt



  

  

  H.-Dirk Schmitt
  

  Dipl.Math.

  eMail:dirk.schm...@computer42.org
  

  mobile:+49 177 616 8564
  

  phone: +49 2642 99 41 14
  

  fax: +49 2642 99 41 15
  

  Schillerstr. 42, D-53489 Sinzig

  pgp: http://www.computer42.org/~dirk/OpenPGP-fingerprint.html




Re: parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

2017-12-04 Thread Chet Ramey
On 12/4/17 1:42 PM, H.-Dirk Schmitt wrote:
> From the 2 replies I unterstand that the implementation in bash is
> correct due to the „official“ standard.
> 
> For myself I have solved the issue in my script - but the regular
> expression I developed for my problem are without the 'non-greedy'
> operator more difficult to read and maintain. From that point of view
> it would be an improvement for bash to implement the non-greedy
> operator.

The thing is, bash doesn't "implement" its regular expressions, per se.
Bash uses the Posix standard library functions (regcomp/regexec) if they
are available in the C library when it's configured and built.  I'm not
wild about adding a dependency on pcre, or a configure test for it, just
to have two varieties of regular expressions available.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

2017-12-04 Thread H.-Dirk Schmitt
>From the 2 replies I unterstand that the implementation in bash is
correct due to the „official“ standard.

For myself I have solved the issue in my script - but the regular
expression I developed for my problem are without the 'non-greedy'
operator more difficult to read and maintain. From that point of view
it would be an improvement for bash to implement the non-greedy
operator.

Also if I look from an „normal developer“ I think it is a common
pitfall if many testing resources and regexp implementations support
the 'non-greedy' operator.

Maybe there is a switch/option to enable the 'non-greedy' operator in a
future release.

So please feel free to change the „bug report“ to a „feature request“
;-) 

Best Regards,

H.-Dirk Schmitt



On So, 2017-12-03 at 15:23 -0500, Chet Ramey wrote:
> On 12/1/17 12:40 PM, d...@computer42.org wrote:
> 
> > Bash Version: 4.4
> > Patch Level: 12
> > Release Status: release
> > 
> > Description:
> >   I'm sanitising urls from advertisement crap. As described below
> > I'm getting a wrong resolution of parenthesised expression defined
> > with non-greedy operator '?'.
> > 
> >   The test url is: http://toolbox.contentspread.net/container/medim
> > ops/track/xx.dyn?csRdu=https://www.medimops.de/?anid=M9
> > 9=details=M99_source=CRM_medium=email
> > m_campaign=OS
> > 
> >   The regular expression is:
> > https?:\/\/toolbox.contentspread.net\/(.*?)=(.+?)&.*
> 
> The Bash =~ operator uses Posix extended regexps (EREs) as defined in
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.ht
> ml#tag_09_04.
> There's no concept of a `non-greedy' operator
> in the Posix ERE definition.
> 
> Chet
> 



Re: parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

2017-12-03 Thread Chet Ramey
On 12/1/17 12:40 PM, d...@computer42.org wrote:

> Bash Version: 4.4
> Patch Level: 12
> Release Status: release
> 
> Description:
>   I'm sanitising urls from advertisement crap. As described below I'm getting 
> a wrong resolution of parenthesised expression defined with non-greedy 
> operator '?'.
> 
>   The test url is: 
> http://toolbox.contentspread.net/container/medimops/track/xx.dyn?csRdu=https://www.medimops.de/?anid=M99=details=M99_source=CRM_medium=email_campaign=OS
> 
>   The regular expression is: 
> https?:\/\/toolbox.contentspread.net\/(.*?)=(.+?)&.*

The Bash =~ operator uses Posix extended regexps (EREs) as defined in
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04.
There's no concept of a `non-greedy' operator
in the Posix ERE definition.

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

2017-12-01 Thread Greg Wooledge
On Fri, Dec 01, 2017 at 06:40:35PM +0100, d...@computer42.org wrote:
>   I'm sanitising urls from advertisement crap. As described below I'm getting 
> a wrong resolution of parenthesised expression defined with non-greedy 
> operator '?'.

> re='https?:\/\/toolbox.contentspread.net\/(.*?)=(.+?)&.*'
> 
> if [[ ${url} =~ ${re} ]]

Bash's =~ operator uses Extended Regular Expressions.  There is no
non-greedy operator (.*? or .+?) in an ERE.  It's a perl extension.

Also, you don't need to escape / but you *do* need to escape dots.



parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

2017-12-01 Thread dirk
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS:  -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64' 
-DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-pc-linux-gnu' 
-DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL 
-DHAVE_CONFIG_H   -I.  -I../. -I.././include -I.././lib  -Wdate-time 
-D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat 
-Werror=format-security -Wall -no-pie -Wno-parentheses -Wno-format-security
uname output: Linux dilbert 4.10.0-41-generic #45~16.04.1-Ubuntu SMP Fri Nov 24 
15:06:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Machine Type: x86_64-pc-linux-gnu

Bash Version: 4.4
Patch Level: 12
Release Status: release

Description:
  I'm sanitising urls from advertisement crap. As described below I'm getting a 
wrong resolution of parenthesised expression defined with non-greedy operator 
'?'.

  The test url is: 
http://toolbox.contentspread.net/container/medimops/track/xx.dyn?csRdu=https://www.medimops.de/?anid=M99=details=M99_source=CRM_medium=email_campaign=OS

  The regular expression is: 
https?:\/\/toolbox.contentspread.net\/(.*?)=(.+?)&.*

  As I understand the specification and verified with 'visual regexp' and 
https://regex101.com/ the result should be:

1 →  container/medimops/track/xx.dyn?csRdu
2 → https://www.medimops.de/?anid=M99

  Running the script below I got instead:
1 → 
container/medimops/track/xx.dyn?csRdu=https://www.medimops.de/?anid=M99=details=M99_source=CRM_medium
2 → email


Repeat-By:

  Test script:
#!/bin/bash

url='http://toolbox.contentspread.net/container/medimops/track/xx.dyn?csRdu=https://www.medimops.de/?anid=M99=details=M99_source=CRM_medium=email_campaign=OS'
re='https?:\/\/toolbox.contentspread.net\/(.*?)=(.+?)&.*'

if [[ ${url} =~ ${re} ]]
then
echo "0 → ${BASH_REMATCH[0]}"
echo "1 → ${BASH_REMATCH[1]}"
echo "2 → ${BASH_REMATCH[2]}"
fi