Re: [ast-developers] rfe: .sh.regex.available_character_class array?

2012-10-26 Thread Cedric Blancher
On 22 October 2012 18:12, Roland Mainz roland.ma...@nrubsig.org wrote:
 On Mon, Oct 22, 2012 at 6:14 AM, Glenn Fowler g...@research.att.com wrote:
 [snip]
 ah but you may have been thinking getconf function and not getconf command
 in that case doing it with the getconf function is probably the way to go

 1. Erm... I think you were right that locale(1) would be a better
 place... but this would mean to create yet-another-builtin. Question
 is... would you be OK with another one... this time to intercept
 /usr/bin/locale and add new options to return valid values for
 |wctype()| and |wctrans()| ?

 2. Below is some prototype code to do the enumeration... does it
 (generally) look OK for use in a locale(1) (or getconf(1)) ?
 -- snip --
 #include stdlib.h
 #include stdio.h
 #include stdbool.h
 #include string.h
 #include wctype.h
 #include locale.h

 const char *character_classes[] =
 {
 /* these are the classes mandated by POSIX */
 alnum,
 alpha,
 blank,
 cntrl,
 digit,
 graph,
 lower,
 print,
 punct,
 space,
 upper,
 xdigit,
 /*
  * these are the classes sampled from various locales on
  * Solaris, FreeBSD and Apple OSX
  */
 english,
 gb,
 ideogram,
 jalpha,
 jdigit,
 jgen,
 jgreek,
 jhankana,
 jhira,
 jisx0201r,
 jisx0208,
 jisx0212,
 jkanji,
 jkata,
 jparen,
 jpunct,
 jrussian,
 jsci,
 jspecial,
 junit,
 line,
 number,
 phonogram,
 special,
 wchar0,
 wchar1,
 wchar2,
 wchar3,
 wchar4,
 wchar5,
 wchar6,
 wchar7,
 wchar8,
 wchar9,
 wchar10,
 wchar11,
 wchar12,
 wchar13,
 wchar14,
 wchar15,
 wchar16,
 wchar17,
 wchar18,
 wchar19,
 wchar20,
 wchar21,
 wchar22,
 wchar23,
 wchar24,
 };

 const char *wc_transformations[]={
 tolower,
 toupper,
 toascii,
 tojhira,
 tojisx0201,
 tojisx0208,
 tojkata,
 totitle,
 };

 #define elementsof(x)(sizeof(x)/sizeof(x[0]))

 static
 const char *get_list_of_supported_wctypes(void)
 {
 int i;
 boolmatched[elementsof(character_classes)+1];
 size_t  size = 0UL;
 const char  *cl;
 char*s, *p;
 charbuff[128];

 for (i=0 ; i  elementsof(character_classes) ; i++)
 {
 cl=character_classes[i];

 /*
  * Some old Unixes like old Solaris have some classes
  * _accidently_ prefixed with is (this happens on
  * other Unixes, too - because the matching data
  * have been both written by the same contractors
  * and/or cross-licensed between different
  * companies).
  * We work-around the issue here by testing both
  * the plain and intended name.
  */
 buff[0]='i';
 buff[1]='s';
 strcpy(buff[2], cl);

 if (wctype(cl) || wctype(buff))
 {
 size+=strlen(cl)+2;
 matched[i]=true;
 }
 else
 {
 matched[i]=false;
 }
 }

 s=p=malloc(size+1);
 if (!s)
 {
 perror(malloc() failed.);
 return (NULL);
 }


 for (i=0 ; i  elementsof(character_classes) ; i++)
 {
 if (matched[i])
 {
 p=stpcpy(p, character_classes[i]);
 *p=' ';
 *++p='\0';
 }
 }

 if (*--p==' ')
 *p='\0';

 return (s);
 }

 static
 const char *get_list_of_supported_wctransformations(void)
 {
 int i;
 boolmatched[elementsof(wc_transformations)+1];
 size_t  size = 0UL;
 const char  *tr;
 char*s, *p;

 for (i=0 ; i  elementsof(wc_transformations) ; i++)
 {
 tr=wc_transformations[i];

 if (wctrans(tr))
 {
 size+=strlen(tr)+2;
 matched[i]=true;
 }
 else
 {
 matched[i]=false;
 }
 }

 s=p=malloc(size+1);
 if (!s)
 {
 perror(malloc() failed.);
 return (NULL);
 

Re: [ast-developers] rfe: .sh.regex.available_character_class array?

2012-10-26 Thread Roland Mainz
On Fri, Oct 26, 2012 at 11:54 AM, Cedric Blancher
cedric.blanc...@googlemail.com wrote:
 On 22 October 2012 18:12, Roland Mainz roland.ma...@nrubsig.org wrote:
 On Mon, Oct 22, 2012 at 6:14 AM, Glenn Fowler g...@research.att.com wrote:
 [snip]
 ah but you may have been thinking getconf function and not getconf command
 in that case doing it with the getconf function is probably the way to go

 1. Erm... I think you were right that locale(1) would be a better
 place... but this would mean to create yet-another-builtin. Question
 is... would you be OK with another one... this time to intercept
 /usr/bin/locale and add new options to return valid values for
 |wctype()| and |wctrans()| ?

 2. Below is some prototype code to do the enumeration... does it
 (generally) look OK for use in a locale(1) (or getconf(1)) ?
 -- snip --
[snip]
 -- snip --

 Roland, thanks for the test code.

 There are two more wctrans() classes you didn't list:
 1. to_outpunct: This is a map from ASCII decimal point and
 thousands-sep to their equivalent in locale. This is defined for
 locales which use extra decimal point and thousands-sep.
 (LC_ALL=fa_IR ~/bin/ksh -c 'typeset -M to_outpunct x ; x=. ; print |$x|')
 |٫|
 (LC_ALL=fa_IR ~/bin/ksh -c 'typeset -M to_outpunct x ; x=, ; print |$x|')
 |٬|

 2. to_inpunct: This is a map from ASCII digits to their equivalent in
 locale. This is defined for locales which use an extra digit set.
 (LC_ALL=fa_IR ~/bin/ksh -c 'typeset -M to_inpunct x ; x=. ; print |$x|')
 |٫|
 (LC_ALL=fa_IR ~/bin/ksh -c 'typeset -M to_inpunct x ; x=, ; print |$x|')
 |٬|

 Example application: Map ascii numbers to their Arabic counterparts:
 (LC_ALL=fa_IR ~/bin/ksh -c 'typeset -M to_inpunct m ; for ((i=1 ; i 
 16384 ; i++ )) ; do p=$(printf \u[$(printf %x i)]) ; m=$p ; [[
 $p == $m ]] || printf %q != %q, %d\n $p $m i ; done')
 , != $'\u[66c]', 44
 . != $'\u[66b]', 46
 0 != ۰, 48
 1 != ۱, 49
 2 != ۲, 50
 3 != ۳, 51
 4 != ۴, 52
 5 != ۵, 53
 6 != ۶, 54
 7 != ۷, 55
 8 != ۸, 56
 9 != ۹, 57

Thanks :-)
... and (as another native Japanese speaker pointed out... I forgot
the jspace character class... ;-( ) ...

... attached (as wcsupportedlists002.c.txt) is an updated version of
the enumeration code which fixes the issues reported.



Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.ma...@nrubsig.org
  \__\/\/__/  MPEG specialist, CJAVASunUnix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)
#include stdlib.h
#include stdio.h
#include stdbool.h
#include string.h
#include wctype.h
#include locale.h

/*
 * List of character classes. Roughly sorted by the places
 * where we found the strings
 */
static
const char *character_classes[] =
{
/* these are the classes mandated by POSIX */
alnum,
alpha,
blank,
cntrl,
digit,
graph,
lower,
print,
punct,
space,
upper,
xdigit,
/*
 * these are the classes sampled from various locales on
 * Solaris, FreeBSD and Apple OSX
 */
english,
gb,
ideogram,
jalpha,
jdigit,
jspace,
jgen,
jgreek,
jhankana,
jhira,
jisx0201r,
jisx0208,
jisx0212,
jkanji,
jkata,
jparen,
jpunct,
jrussian,
jsci,
jspecial,
junit,
line,
number,
phonogram,
special,
/* wchar0-24 are used by IBM's/OpenGroup's libc_i18n code */
wchar0,
wchar1,
wchar2,
wchar3,
wchar4,
wchar5,
wchar6,
wchar7,
wchar8,
wchar9,
wchar10,
wchar11,
wchar12,
wchar13,
wchar14,
wchar15,
wchar16,
wchar17,
wchar18,
wchar19,
wchar20,
wchar21,
wchar22,
wchar23,
wchar24,
};

static
const char *wc_transformations[]={
tolower,
toupper,
toascii,
tojhira,
tojisx0201,
tojisx0208,
tojkata,
totitle,
to_inpunct,
to_outpunct
};

#define elementsof(x)(sizeof(x)/sizeof((x)[0]))

static
const char *get_list_of_supported_wctypes(void)
{
int i;
boolmatched[elementsof(character_classes)+1];
size_t  size = 0UL;
const char  *cl;
char*s, *p;
charbuff[128]; 

for (i=0 ; i  elementsof(character_classes) ; i++)
{
cl=character_classes[i];

/*
 * Some old Unixes like old Solaris have some classes
 * _accidently_ prefixed with is (this happens on
 * other Unixes, too - because the matching data
 * have been both written by the same contractors
 * and/or cross-licensed between different
 * companies).

Re: [ast-developers] rfe: .sh.regex.available_character_class array?

2012-10-23 Thread Cedric Blancher
On 22 October 2012 19:44, Roland Mainz roland.ma...@nrubsig.org wrote:
 On Fri, Oct 19, 2012 at 3:38 PM, Cedric Blancher
 cedric.blanc...@googlemail.com wrote:
 Request for enhancement: .sh.regex.available_character_class

 What do you think about adding a  .sh.regex.available_character_class
 array variable which contains the list of available wctype character
 classes for the current locale? I know there is no API to get a list
 from the OS but libast could probe well-known names and put only those
 in the array for which wctype() turned a non-0 value.

 Erm... just curious: What is the usage scenario for such a feature ?

We build regular expressions dynamically, based on other input data.
The extra character classes help a lot when processing Japanese texts
because they make the regular expressions MUCH shorter, usually by
dozens of sub-expressions. The problem is that a lot of platforms
(Linux!!) sometimes lack the extra classes we have in Solaris or AIX
which severely cripples pattern matching performance.

Ced
-- 
Cedric Blancher cedric.blanc...@googlemail.com
Institute Pasteur
___
ast-developers mailing list
ast-developers@research.att.com
https://mailman.research.att.com/mailman/listinfo/ast-developers


Re: [ast-developers] rfe: .sh.regex.available_character_class array?

2012-10-22 Thread Roland Mainz
On Mon, Oct 22, 2012 at 6:14 AM, Glenn Fowler g...@research.att.com wrote:
[snip]
 ah but you may have been thinking getconf function and not getconf command
 in that case doing it with the getconf function is probably the way to go

1. Erm... I think you were right that locale(1) would be a better
place... but this would mean to create yet-another-builtin. Question
is... would you be OK with another one... this time to intercept
/usr/bin/locale and add new options to return valid values for
|wctype()| and |wctrans()| ?

2. Below is some prototype code to do the enumeration... does it
(generally) look OK for use in a locale(1) (or getconf(1)) ?
-- snip --
#include stdlib.h
#include stdio.h
#include stdbool.h
#include string.h
#include wctype.h
#include locale.h

const char *character_classes[] =
{
/* these are the classes mandated by POSIX */
alnum,
alpha,
blank,
cntrl,
digit,
graph,
lower,
print,
punct,
space,
upper,
xdigit,
/*
 * these are the classes sampled from various locales on
 * Solaris, FreeBSD and Apple OSX
 */
english,
gb,
ideogram,
jalpha,
jdigit,
jgen,
jgreek,
jhankana,
jhira,
jisx0201r,
jisx0208,
jisx0212,
jkanji,
jkata,
jparen,
jpunct,
jrussian,
jsci,
jspecial,
junit,
line,
number,
phonogram,
special,
wchar0,
wchar1,
wchar2,
wchar3,
wchar4,
wchar5,
wchar6,
wchar7,
wchar8,
wchar9,
wchar10,
wchar11,
wchar12,
wchar13,
wchar14,
wchar15,
wchar16,
wchar17,
wchar18,
wchar19,
wchar20,
wchar21,
wchar22,
wchar23,
wchar24,
};

const char *wc_transformations[]={
tolower,
toupper,
toascii,
tojhira,
tojisx0201,
tojisx0208,
tojkata,
totitle,
};

#define elementsof(x)(sizeof(x)/sizeof(x[0]))

static
const char *get_list_of_supported_wctypes(void)
{
int i;
boolmatched[elementsof(character_classes)+1];
size_t  size = 0UL;
const char  *cl;
char*s, *p;
charbuff[128];

for (i=0 ; i  elementsof(character_classes) ; i++)
{
cl=character_classes[i];

/*
 * Some old Unixes like old Solaris have some classes
 * _accidently_ prefixed with is (this happens on
 * other Unixes, too - because the matching data
 * have been both written by the same contractors
 * and/or cross-licensed between different
 * companies).
 * We work-around the issue here by testing both
 * the plain and intended name.
 */
buff[0]='i';
buff[1]='s';
strcpy(buff[2], cl);

if (wctype(cl) || wctype(buff))
{
size+=strlen(cl)+2;
matched[i]=true;
}
else
{
matched[i]=false;
}
}

s=p=malloc(size+1);
if (!s)
{
perror(malloc() failed.);
return (NULL);
}


for (i=0 ; i  elementsof(character_classes) ; i++)
{
if (matched[i])
{
p=stpcpy(p, character_classes[i]);
*p=' ';
*++p='\0';
}
}

if (*--p==' ')
*p='\0';

return (s);
}

static
const char *get_list_of_supported_wctransformations(void)
{
int i;
boolmatched[elementsof(wc_transformations)+1];
size_t  size = 0UL;
const char  *tr;
char*s, *p;

for (i=0 ; i  elementsof(wc_transformations) ; i++)
{
tr=wc_transformations[i];

if (wctrans(tr))
{
size+=strlen(tr)+2;
matched[i]=true;
}
else
{
matched[i]=false;
}
}

s=p=malloc(size+1);
if (!s)
{
perror(malloc() failed.);
return (NULL);
}


for (i=0 ; i  elementsof(wc_transformations) ; i++)
{
if (matched[i])
{
p=stpcpy(p, wc_transformations[i]);
*p=' ';
 

Re: [ast-developers] rfe: .sh.regex.available_character_class array?

2012-10-22 Thread Roland Mainz
On Fri, Oct 19, 2012 at 3:38 PM, Cedric Blancher
cedric.blanc...@googlemail.com wrote:
 Request for enhancement: .sh.regex.available_character_class

 What do you think about adding a  .sh.regex.available_character_class
 array variable which contains the list of available wctype character
 classes for the current locale? I know there is no API to get a list
 from the OS but libast could probe well-known names and put only those
 in the array for which wctype() turned a non-0 value.

Erm... just curious: What is the usage scenario for such a feature ?



Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.ma...@nrubsig.org
  \__\/\/__/  MPEG specialist, CJAVASunUnix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)
___
ast-developers mailing list
ast-developers@research.att.com
https://mailman.research.att.com/mailman/listinfo/ast-developers


Re: [ast-developers] rfe: .sh.regex.available_character_class array?

2012-10-21 Thread Roland Mainz
On Fri, Oct 19, 2012 at 3:38 PM, Cedric Blancher
cedric.blanc...@googlemail.com wrote:
 Request for enhancement: .sh.regex.available_character_class

 What do you think about adding a  .sh.regex.available_character_class
 array variable which contains the list of available wctype character
 classes for the current locale? I know there is no API to get a list
 from the OS but libast could probe well-known names and put only those
 in the array for which wctype() turned a non-0 value.

IMO it's better to let getconf handle that job because these are
locale properties which are not limited to the shell.
AFAIK we need two different getconf properties - one for regex
character classes and one for |wctrans()| transformations.
I did some digging... and it seems Solaris 11 supports the following
transformations (beyond POSIX  ; these are locale-dependant):
-- snip --
tojhira
tojisx0201
tojisx0208
tojkata
tolower
toupper
-- snip --
... Linux adds totitle.

Character classes (beyond POSIX  ; these are locale-dependant)
supported by Solaris 11 are:
-- snip --
english
gb
ideogram
jalpha
jdigit
jgen
jgreek
jhankana
jhira
jisx0201r
jisx0208
jisx0212
jkanji
jkata
jparen
jpunct
jrussian
jsci
jspecial
junit
line
number
phonogram
special
wchar10
wchar11
wchar12
wchar13
wchar14
wchar15
wchar16
wchar17
wchar18
wchar19
wchar20
wchar21
wchar22
wchar23
wchar24
wchar6
wchar9
-- snip --
(note that some of these are errornously prefixed with is in some
older Solaris versions). FreeBSD/OSX and Illumos add rune as extra
class here.

Glenn: What do you think about the idea of using getconf for this ?
If you think this is OK then I can provide code who can test these
well-known names (erm... including the is-prefix for character
classes) for both (note that we cannot cache the values because they
depend on LANG/LC_CTYPE/LC_ALL and IMO it's cheaper to probe the
values each time getconf is called than trying to add more code for
caching and tracking of the values of LANG/LC_CTYPE/LC_ALL).



Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.ma...@nrubsig.org
  \__\/\/__/  MPEG specialist, CJAVASunUnix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)
___
ast-developers mailing list
ast-developers@research.att.com
https://mailman.research.att.com/mailman/listinfo/ast-developers


Re: [ast-developers] rfe: .sh.regex.available_character_class array?

2012-10-21 Thread Glenn Fowler

ah but you may have been thinking getconf function and not getconf command
in that case doing it with the getconf function is probably the way to go

On Mon, 22 Oct 2012 00:10:28 -0400 Glenn Fowler wrote:
 locale(1) would be my first choice
 but getonf(1) would be ok too

 On Mon, 22 Oct 2012 01:34:38 +0200 Roland Mainz wrote:
  On Fri, Oct 19, 2012 at 3:38 PM, Cedric Blancher
  cedric.blanc...@googlemail.com wrote:
   Request for enhancement: .sh.regex.available_character_class
  
   What do you think about adding a  .sh.regex.available_character_class
   array variable which contains the list of available wctype character
   classes for the current locale? I know there is no API to get a list
   from the OS but libast could probe well-known names and put only those
   in the array for which wctype() turned a non-0 value.

  IMO it's better to let getconf handle that job because these are
  locale properties which are not limited to the shell.
  AFAIK we need two different getconf properties - one for regex
  character classes and one for |wctrans()| transformations.
  I did some digging... and it seems Solaris 11 supports the following
  transformations (beyond POSIX  ; these are locale-dependant):
  -- snip --
  tojhira
  tojisx0201
  tojisx0208
  tojkata
  tolower
  toupper
  -- snip --
  ... Linux adds totitle.

  Character classes (beyond POSIX  ; these are locale-dependant)
  supported by Solaris 11 are:
  -- snip --
  english
  gb
  ideogram
  jalpha
  jdigit
  jgen
  jgreek
  jhankana
  jhira
  jisx0201r
  jisx0208
  jisx0212
  jkanji
  jkata
  jparen
  jpunct
  jrussian
  jsci
  jspecial
  junit
  line
  number
  phonogram
  special
  wchar10
  wchar11
  wchar12
  wchar13
  wchar14
  wchar15
  wchar16
  wchar17
  wchar18
  wchar19
  wchar20
  wchar21
  wchar22
  wchar23
  wchar24
  wchar6
  wchar9
  -- snip --
  (note that some of these are errornously prefixed with is in some
  older Solaris versions). FreeBSD/OSX and Illumos add rune as extra
  class here.

  Glenn: What do you think about the idea of using getconf for this ?
  If you think this is OK then I can provide code who can test these
  well-known names (erm... including the is-prefix for character
  classes) for both (note that we cannot cache the values because they
  depend on LANG/LC_CTYPE/LC_ALL and IMO it's cheaper to probe the
  values each time getconf is called than trying to add more code for
  caching and tracking of the values of LANG/LC_CTYPE/LC_ALL).

  

  Bye,
  Roland

  -- 
__ .  . __
   (o.\ \/ /.o) roland.ma...@nrubsig.org
\__\/\/__/  MPEG specialist, CJAVASunUnix programmer
/O /==\ O\  TEL +49 641 3992797
   (;O/ \/ \O;)

___
ast-developers mailing list
ast-developers@research.att.com
https://mailman.research.att.com/mailman/listinfo/ast-developers