Re: [CFT] Unicode collation string and reworked locale definitions

2015-11-03 Thread Pedro Giffuni
Hello;

> Il giorno 03/nov/2015, alle ore 10:52, Wolfgang Jenkner  
> ha scritto:
> 
> On Tue, Nov 03 2015, Pedro Giffuni wrote:
> 
>> What worries me about libtre is that it lacks important functionality like 
>> word
>> delimiters. We even brought the sysv delimiters to be more compatible with
>> Solaris and GNU and we can’t back those out now:
>> 
>> https://svnweb.freebsd.org/base?view=revision&revision=268066
> 
> It supports \< and \> out of the box, cf.
> 
> https://github.com/laurikari/tre/blob/master/doc/tre-syntax.html
> 
> And the darwin patch mentioned above implements [[:<:]] and [[:>:]], see
> 
> http://www.opensource.apple.com/source/Libc/Libc-1044.40.1/regex/TRE/lib/tre-parse.c
> 
> That patch also implements the REG_STARTEND flag for regexec(3), which
> is needed for vi.
> 
> Also, tre provides wchar versions for regcomp(3) and friends, so that
> nvi wouldn't need its own private regex library anymore.

Interesting thanks.

I only looked at it transitorily long ago, I noticed there was a big TODO
list and that the Apple patches were partially copyleft (APSL) so I
didn’t dig into it too much.

It certainly has to be evaluated.

Pedro.

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [CFT] Unicode collation string and reworked locale definitions

2015-11-03 Thread Wolfgang Jenkner
On Tue, Nov 03 2015, Pedro Giffuni wrote:

> What worries me about libtre is that it lacks important functionality like 
> word
> delimiters. We even brought the sysv delimiters to be more compatible with
> Solaris and GNU and we can’t back those out now:
>
> https://svnweb.freebsd.org/base?view=revision&revision=268066

It supports \< and \> out of the box, cf.

https://github.com/laurikari/tre/blob/master/doc/tre-syntax.html

And the darwin patch mentioned above implements [[:<:]] and [[:>:]], see

http://www.opensource.apple.com/source/Libc/Libc-1044.40.1/regex/TRE/lib/tre-parse.c

That patch also implements the REG_STARTEND flag for regexec(3), which
is needed for vi.

Also, tre provides wchar versions for regcomp(3) and friends, so that
nvi wouldn't need its own private regex library anymore.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [CFT] Unicode collation string and reworked locale definitions

2015-11-03 Thread Pedro Giffuni
Hi Baptiste;

> Il giorno 03/nov/2015, alle ore 02:17, Baptiste Daroussin  
> ha scritto:
> 
> On Mon, Nov 02, 2015 at 06:59:15PM -0500, Pedro Giffuni wrote:
>> First of all, congratulations to Baptiste and Marino for succeeding where
>> I failed many moons ago. Also huge thanks to Nexenta and Garret D’Amore
>> for relicensing localedef for us.
>> 
>> Concerning regex;
>> 
>> Gabor@ did a lot of work on libtre but according to him it was not up to the
>> task performancewise. We would also lose features if we move to libtre.
>> 
>> I think our regex code actually has most of what is needed for multibyte
>> already. I have a patch that turns on the functionality but I haven’t found
>> any brave soul that will do the testing:
>> 
>> https://people.freebsd.org/~pfg/patches/regex-multibyte.diff
>> 
> I think it this can be tested once the collation branch is merged.

Absolutely: support for collation is critical and badly needed even without
resolving the regex issues.

> Note that
> dragonfly and musl libc both uses a patched version of libtre for the regex
> implementation.
> 

I am aware. Also note that Gabor had some patches too, in order to make
it usable for bsdgrep:

https://wiki.freebsd.org/Regex

> From my non scientific testing libtre was more reliable and performant then 
> our
> current regex.

According to Gabor, the general performance was better until you take into
account multibyte support where it was clearly inferior to GNU regex.

> Anyway it will be relatively "easy" to test using the AT&T
> testsuite the reliability and performance of both implementations: ours + your
> patch and patched libtre.
> 


What worries me about libtre is that it lacks important functionality like word
delimiters. We even brought the sysv delimiters to be more compatible with
Solaris and GNU and we can’t back those out now:

https://svnweb.freebsd.org/base?view=revision&revision=268066

Pedro.


___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [CFT] Unicode collation string and reworked locale definitions

2015-11-02 Thread Baptiste Daroussin
On Mon, Nov 02, 2015 at 06:59:15PM -0500, Pedro Giffuni wrote:
> First of all, congratulations to Baptiste and Marino for succeeding where
> I failed many moons ago. Also huge thanks to Nexenta and Garret D’Amore
> for relicensing localedef for us.
> 
> Concerning regex;
> 
> Gabor@ did a lot of work on libtre but according to him it was not up to the
> task performancewise. We would also lose features if we move to libtre.
> 
> I think our regex code actually has most of what is needed for multibyte
> already. I have a patch that turns on the functionality but I haven’t found
> any brave soul that will do the testing:
> 
> https://people.freebsd.org/~pfg/patches/regex-multibyte.diff
> 
I think it this can be tested once the collation branch is merged. Note that
dragonfly and musl libc both uses a patched version of libtre for the regex
implementation.

From my non scientific testing libtre was more reliable and performant then our
current regex. Anyway it will be relatively "easy" to test using the AT&T
testsuite the reliability and performance of both implementations: ours + your
patch and patched libtre.

Best regards,
Bapt


signature.asc
Description: PGP signature


Re: [CFT] Unicode collation string and reworked locale definitions

2015-11-02 Thread Pedro Giffuni
First of all, congratulations to Baptiste and Marino for succeeding where
I failed many moons ago. Also huge thanks to Nexenta and Garret D’Amore
for relicensing localedef for us.

Concerning regex;

Gabor@ did a lot of work on libtre but according to him it was not up to the
task performancewise. We would also lose features if we move to libtre.

I think our regex code actually has most of what is needed for multibyte
already. I have a patch that turns on the functionality but I haven’t found
any brave soul that will do the testing:

https://people.freebsd.org/~pfg/patches/regex-multibyte.diff

Thanks again,

Pedro.

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [CFT] Unicode collation string and reworked locale definitions

2015-11-02 Thread Garrett Cooper

> On Nov 2, 2015, at 02:17, Baptiste Daroussin  wrote:
> 
>> On Mon, Nov 02, 2015 at 10:04:11AM +, David Chisnall wrote:
>>> On 1 Nov 2015, at 21:30, Baptiste Daroussin  wrote:
>>> 
>>> All issues reported has been fixed, except if more issues are reported, this
>>> will be merged into head next saturday: November 7th
>> 
>> That’s really excellent news!  Thanks for doing this.  Are there any good 
>> potential sources for the regex stuff?  I think std::regex in libc++ 
>> supports multibyte character sets, but is very full of templates and not 
>> very easy to translate into C.
> For te regex tools, it will be another step. I was planning to incorporate
> libtre + apple's patches like dragonfly did, it would need a lot of tests, but
> from my current testing performances are better than our current 
> implementation.
> And it makes libc's regrex passing way more entries in the AT&T regex test 
> suite
> 
> If anyone else want to work on bringing in that I would be very glad as I have
> already too much things in my plate :)

I was about to say... The regex tests on FreeBSD in tools/regression/lib/libc 
are quite broken ;(.. (Bug 191354). I'd like to fix/salvage those test cases if 
at all possible -- this might be a good motivator for that.
Thanks,
-NGie
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [CFT] Unicode collation string and reworked locale definitions

2015-11-02 Thread Baptiste Daroussin
On Mon, Nov 02, 2015 at 10:04:11AM +, David Chisnall wrote:
> On 1 Nov 2015, at 21:30, Baptiste Daroussin  wrote:
> > 
> > All issues reported has been fixed, except if more issues are reported, this
> > will be merged into head next saturday: November 7th
> 
> That’s really excellent news!  Thanks for doing this.  Are there any good 
> potential sources for the regex stuff?  I think std::regex in libc++ supports 
> multibyte character sets, but is very full of templates and not very easy to 
> translate into C.
> 
For te regex tools, it will be another step. I was planning to incorporate
libtre + apple's patches like dragonfly did, it would need a lot of tests, but
from my current testing performances are better than our current implementation.
And it makes libc's regrex passing way more entries in the AT&T regex test suite

If anyone else want to work on bringing in that I would be very glad as I have
already too much things in my plate :)

Best regards,
Bapt


signature.asc
Description: PGP signature


Re: [CFT] Unicode collation string and reworked locale definitions

2015-11-02 Thread David Chisnall
On 1 Nov 2015, at 21:30, Baptiste Daroussin  wrote:
> 
> All issues reported has been fixed, except if more issues are reported, this
> will be merged into head next saturday: November 7th

That’s really excellent news!  Thanks for doing this.  Are there any good 
potential sources for the regex stuff?  I think std::regex in libc++ supports 
multibyte character sets, but is very full of templates and not very easy to 
translate into C.

David

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [CFT] Unicode collation string and reworked locale definitions

2015-11-01 Thread Andreas Tobler

On 01.11.15 22:30, Baptiste Daroussin wrote:

On Wed, Oct 14, 2015 at 12:23:06AM +0200, Baptiste Daroussin wrote:

Hi all,

I have been working for a while on bringing in Unicode string collation
support by merging code from Illumos (by Garrett D'Amore who kindly made sure
his work was made under BSD license) and Dragonfly (by John Marino), and some
ancient work done on FreeBSD by edwin@ but never merged.

The result is available in the projects/collation branch.

As a result of this work, is:
- Locales are now generated with the new localedef(1) tool from CLDR POSIX files
- The generated files are now identified as "BSD 1.0" format
- Only "BSD 1.0" locales files are now read, all other version will be set to
   "C"
- The localedef(1) tool has been imported from Illumos and modidied to use
   tree(3) instead of the CDDL avl(3)
- A set of tool created by edwin@ and extended by marino@ for dragonfly has been
   added to be able to generate locales
- Given our regex(3) does not support multibyte yet (actually it does not
   support some single-byte codeset) it has been forced to always use locale C
- Remove colldef(1) and mklocale(1)
- Finish implementing the numeric BSD extension for ctypes
- Add a bunch of new locales: some arabian locales, hebrew locales, some
   regional locales, etc.
- Make a bunch of ISO-8859-1 locales simple aliase on ISO-8859-15 where it makes
   sense
- Add short version of locales
- Add @euro aliases on the locales where that make sense

Please test the branch and report issues.

Note that yes that means the COLLATION_FIX patch on glib2 will not be necessary
anymore
and yes the icu patch on postgresql will not be necessary anymore

Best regards,
Bapt


All issues reported has been fixed, except if more issues are reported, this
will be merged into head next saturday: November 7th



Cool! Waiting for it!

Thanks,
Andreas

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [CFT] Unicode collation string and reworked locale definitions

2015-11-01 Thread Baptiste Daroussin
On Wed, Oct 14, 2015 at 12:23:06AM +0200, Baptiste Daroussin wrote:
> Hi all,
> 
> I have been working for a while on bringing in Unicode string collation
> support by merging code from Illumos (by Garrett D'Amore who kindly made sure
> his work was made under BSD license) and Dragonfly (by John Marino), and some
> ancient work done on FreeBSD by edwin@ but never merged.
> 
> The result is available in the projects/collation branch.
> 
> As a result of this work, is:
> - Locales are now generated with the new localedef(1) tool from CLDR POSIX 
> files
> - The generated files are now identified as "BSD 1.0" format
> - Only "BSD 1.0" locales files are now read, all other version will be set to
>   "C"
> - The localedef(1) tool has been imported from Illumos and modidied to use
>   tree(3) instead of the CDDL avl(3)
> - A set of tool created by edwin@ and extended by marino@ for dragonfly has 
> been
>   added to be able to generate locales
> - Given our regex(3) does not support multibyte yet (actually it does not
>   support some single-byte codeset) it has been forced to always use locale C
> - Remove colldef(1) and mklocale(1)
> - Finish implementing the numeric BSD extension for ctypes
> - Add a bunch of new locales: some arabian locales, hebrew locales, some
>   regional locales, etc.
> - Make a bunch of ISO-8859-1 locales simple aliase on ISO-8859-15 where it 
> makes
>   sense
> - Add short version of locales
> - Add @euro aliases on the locales where that make sense
> 
> Please test the branch and report issues.
> 
> Note that yes that means the COLLATION_FIX patch on glib2 will not be 
> necessary
> anymore
> and yes the icu patch on postgresql will not be necessary anymore
> 
> Best regards,
> Bapt

All issues reported has been fixed, except if more issues are reported, this
will be merged into head next saturday: November 7th

Bapt


signature.asc
Description: PGP signature