Re: [CFT] Unicode collation string and reworked locale definitions
Hello; > Il giorno 03/nov/2015, alle ore 10:52, Wolfgang Jenkner > ha scritto: > > On Tue, Nov 03 2015, Pedro Giffuni wrote: > >> What worries me about libtre is that it lacks important functionality like >> word >> delimiters. We even brought the sysv delimiters to be more compatible with >> Solaris and GNU and we can’t back those out now: >> >> https://svnweb.freebsd.org/base?view=revision&revision=268066 > > It supports \< and \> out of the box, cf. > > https://github.com/laurikari/tre/blob/master/doc/tre-syntax.html > > And the darwin patch mentioned above implements [[:<:]] and [[:>:]], see > > http://www.opensource.apple.com/source/Libc/Libc-1044.40.1/regex/TRE/lib/tre-parse.c > > That patch also implements the REG_STARTEND flag for regexec(3), which > is needed for vi. > > Also, tre provides wchar versions for regcomp(3) and friends, so that > nvi wouldn't need its own private regex library anymore. Interesting thanks. I only looked at it transitorily long ago, I noticed there was a big TODO list and that the Apple patches were partially copyleft (APSL) so I didn’t dig into it too much. It certainly has to be evaluated. Pedro. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [CFT] Unicode collation string and reworked locale definitions
On Tue, Nov 03 2015, Pedro Giffuni wrote: > What worries me about libtre is that it lacks important functionality like > word > delimiters. We even brought the sysv delimiters to be more compatible with > Solaris and GNU and we can’t back those out now: > > https://svnweb.freebsd.org/base?view=revision&revision=268066 It supports \< and \> out of the box, cf. https://github.com/laurikari/tre/blob/master/doc/tre-syntax.html And the darwin patch mentioned above implements [[:<:]] and [[:>:]], see http://www.opensource.apple.com/source/Libc/Libc-1044.40.1/regex/TRE/lib/tre-parse.c That patch also implements the REG_STARTEND flag for regexec(3), which is needed for vi. Also, tre provides wchar versions for regcomp(3) and friends, so that nvi wouldn't need its own private regex library anymore. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [CFT] Unicode collation string and reworked locale definitions
Hi Baptiste; > Il giorno 03/nov/2015, alle ore 02:17, Baptiste Daroussin > ha scritto: > > On Mon, Nov 02, 2015 at 06:59:15PM -0500, Pedro Giffuni wrote: >> First of all, congratulations to Baptiste and Marino for succeeding where >> I failed many moons ago. Also huge thanks to Nexenta and Garret D’Amore >> for relicensing localedef for us. >> >> Concerning regex; >> >> Gabor@ did a lot of work on libtre but according to him it was not up to the >> task performancewise. We would also lose features if we move to libtre. >> >> I think our regex code actually has most of what is needed for multibyte >> already. I have a patch that turns on the functionality but I haven’t found >> any brave soul that will do the testing: >> >> https://people.freebsd.org/~pfg/patches/regex-multibyte.diff >> > I think it this can be tested once the collation branch is merged. Absolutely: support for collation is critical and badly needed even without resolving the regex issues. > Note that > dragonfly and musl libc both uses a patched version of libtre for the regex > implementation. > I am aware. Also note that Gabor had some patches too, in order to make it usable for bsdgrep: https://wiki.freebsd.org/Regex > From my non scientific testing libtre was more reliable and performant then > our > current regex. According to Gabor, the general performance was better until you take into account multibyte support where it was clearly inferior to GNU regex. > Anyway it will be relatively "easy" to test using the AT&T > testsuite the reliability and performance of both implementations: ours + your > patch and patched libtre. > What worries me about libtre is that it lacks important functionality like word delimiters. We even brought the sysv delimiters to be more compatible with Solaris and GNU and we can’t back those out now: https://svnweb.freebsd.org/base?view=revision&revision=268066 Pedro. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [CFT] Unicode collation string and reworked locale definitions
On Mon, Nov 02, 2015 at 06:59:15PM -0500, Pedro Giffuni wrote: > First of all, congratulations to Baptiste and Marino for succeeding where > I failed many moons ago. Also huge thanks to Nexenta and Garret D’Amore > for relicensing localedef for us. > > Concerning regex; > > Gabor@ did a lot of work on libtre but according to him it was not up to the > task performancewise. We would also lose features if we move to libtre. > > I think our regex code actually has most of what is needed for multibyte > already. I have a patch that turns on the functionality but I haven’t found > any brave soul that will do the testing: > > https://people.freebsd.org/~pfg/patches/regex-multibyte.diff > I think it this can be tested once the collation branch is merged. Note that dragonfly and musl libc both uses a patched version of libtre for the regex implementation. From my non scientific testing libtre was more reliable and performant then our current regex. Anyway it will be relatively "easy" to test using the AT&T testsuite the reliability and performance of both implementations: ours + your patch and patched libtre. Best regards, Bapt signature.asc Description: PGP signature
Re: [CFT] Unicode collation string and reworked locale definitions
First of all, congratulations to Baptiste and Marino for succeeding where I failed many moons ago. Also huge thanks to Nexenta and Garret D’Amore for relicensing localedef for us. Concerning regex; Gabor@ did a lot of work on libtre but according to him it was not up to the task performancewise. We would also lose features if we move to libtre. I think our regex code actually has most of what is needed for multibyte already. I have a patch that turns on the functionality but I haven’t found any brave soul that will do the testing: https://people.freebsd.org/~pfg/patches/regex-multibyte.diff Thanks again, Pedro. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [CFT] Unicode collation string and reworked locale definitions
> On Nov 2, 2015, at 02:17, Baptiste Daroussin wrote: > >> On Mon, Nov 02, 2015 at 10:04:11AM +, David Chisnall wrote: >>> On 1 Nov 2015, at 21:30, Baptiste Daroussin wrote: >>> >>> All issues reported has been fixed, except if more issues are reported, this >>> will be merged into head next saturday: November 7th >> >> That’s really excellent news! Thanks for doing this. Are there any good >> potential sources for the regex stuff? I think std::regex in libc++ >> supports multibyte character sets, but is very full of templates and not >> very easy to translate into C. > For te regex tools, it will be another step. I was planning to incorporate > libtre + apple's patches like dragonfly did, it would need a lot of tests, but > from my current testing performances are better than our current > implementation. > And it makes libc's regrex passing way more entries in the AT&T regex test > suite > > If anyone else want to work on bringing in that I would be very glad as I have > already too much things in my plate :) I was about to say... The regex tests on FreeBSD in tools/regression/lib/libc are quite broken ;(.. (Bug 191354). I'd like to fix/salvage those test cases if at all possible -- this might be a good motivator for that. Thanks, -NGie ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [CFT] Unicode collation string and reworked locale definitions
On Mon, Nov 02, 2015 at 10:04:11AM +, David Chisnall wrote: > On 1 Nov 2015, at 21:30, Baptiste Daroussin wrote: > > > > All issues reported has been fixed, except if more issues are reported, this > > will be merged into head next saturday: November 7th > > That’s really excellent news! Thanks for doing this. Are there any good > potential sources for the regex stuff? I think std::regex in libc++ supports > multibyte character sets, but is very full of templates and not very easy to > translate into C. > For te regex tools, it will be another step. I was planning to incorporate libtre + apple's patches like dragonfly did, it would need a lot of tests, but from my current testing performances are better than our current implementation. And it makes libc's regrex passing way more entries in the AT&T regex test suite If anyone else want to work on bringing in that I would be very glad as I have already too much things in my plate :) Best regards, Bapt signature.asc Description: PGP signature
Re: [CFT] Unicode collation string and reworked locale definitions
On 1 Nov 2015, at 21:30, Baptiste Daroussin wrote: > > All issues reported has been fixed, except if more issues are reported, this > will be merged into head next saturday: November 7th That’s really excellent news! Thanks for doing this. Are there any good potential sources for the regex stuff? I think std::regex in libc++ supports multibyte character sets, but is very full of templates and not very easy to translate into C. David ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [CFT] Unicode collation string and reworked locale definitions
On 01.11.15 22:30, Baptiste Daroussin wrote: On Wed, Oct 14, 2015 at 12:23:06AM +0200, Baptiste Daroussin wrote: Hi all, I have been working for a while on bringing in Unicode string collation support by merging code from Illumos (by Garrett D'Amore who kindly made sure his work was made under BSD license) and Dragonfly (by John Marino), and some ancient work done on FreeBSD by edwin@ but never merged. The result is available in the projects/collation branch. As a result of this work, is: - Locales are now generated with the new localedef(1) tool from CLDR POSIX files - The generated files are now identified as "BSD 1.0" format - Only "BSD 1.0" locales files are now read, all other version will be set to "C" - The localedef(1) tool has been imported from Illumos and modidied to use tree(3) instead of the CDDL avl(3) - A set of tool created by edwin@ and extended by marino@ for dragonfly has been added to be able to generate locales - Given our regex(3) does not support multibyte yet (actually it does not support some single-byte codeset) it has been forced to always use locale C - Remove colldef(1) and mklocale(1) - Finish implementing the numeric BSD extension for ctypes - Add a bunch of new locales: some arabian locales, hebrew locales, some regional locales, etc. - Make a bunch of ISO-8859-1 locales simple aliase on ISO-8859-15 where it makes sense - Add short version of locales - Add @euro aliases on the locales where that make sense Please test the branch and report issues. Note that yes that means the COLLATION_FIX patch on glib2 will not be necessary anymore and yes the icu patch on postgresql will not be necessary anymore Best regards, Bapt All issues reported has been fixed, except if more issues are reported, this will be merged into head next saturday: November 7th Cool! Waiting for it! Thanks, Andreas ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [CFT] Unicode collation string and reworked locale definitions
On Wed, Oct 14, 2015 at 12:23:06AM +0200, Baptiste Daroussin wrote: > Hi all, > > I have been working for a while on bringing in Unicode string collation > support by merging code from Illumos (by Garrett D'Amore who kindly made sure > his work was made under BSD license) and Dragonfly (by John Marino), and some > ancient work done on FreeBSD by edwin@ but never merged. > > The result is available in the projects/collation branch. > > As a result of this work, is: > - Locales are now generated with the new localedef(1) tool from CLDR POSIX > files > - The generated files are now identified as "BSD 1.0" format > - Only "BSD 1.0" locales files are now read, all other version will be set to > "C" > - The localedef(1) tool has been imported from Illumos and modidied to use > tree(3) instead of the CDDL avl(3) > - A set of tool created by edwin@ and extended by marino@ for dragonfly has > been > added to be able to generate locales > - Given our regex(3) does not support multibyte yet (actually it does not > support some single-byte codeset) it has been forced to always use locale C > - Remove colldef(1) and mklocale(1) > - Finish implementing the numeric BSD extension for ctypes > - Add a bunch of new locales: some arabian locales, hebrew locales, some > regional locales, etc. > - Make a bunch of ISO-8859-1 locales simple aliase on ISO-8859-15 where it > makes > sense > - Add short version of locales > - Add @euro aliases on the locales where that make sense > > Please test the branch and report issues. > > Note that yes that means the COLLATION_FIX patch on glib2 will not be > necessary > anymore > and yes the icu patch on postgresql will not be necessary anymore > > Best regards, > Bapt All issues reported has been fixed, except if more issues are reported, this will be merged into head next saturday: November 7th Bapt signature.asc Description: PGP signature
[CFT] Unicode collation string and reworked locale definitions
Hi all, I have been working for a while on bringing in Unicode string collation support by merging code from Illumos (by Garrett D'Amore who kindly made sure his work was made under BSD license) and Dragonfly (by John Marino), and some ancient work done on FreeBSD by edwin@ but never merged. The result is available in the projects/collation branch. As a result of this work, is: - Locales are now generated with the new localedef(1) tool from CLDR POSIX files - The generated files are now identified as "BSD 1.0" format - Only "BSD 1.0" locales files are now read, all other version will be set to "C" - The localedef(1) tool has been imported from Illumos and modidied to use tree(3) instead of the CDDL avl(3) - A set of tool created by edwin@ and extended by marino@ for dragonfly has been added to be able to generate locales - Given our regex(3) does not support multibyte yet (actually it does not support some single-byte codeset) it has been forced to always use locale C - Remove colldef(1) and mklocale(1) - Finish implementing the numeric BSD extension for ctypes - Add a bunch of new locales: some arabian locales, hebrew locales, some regional locales, etc. - Make a bunch of ISO-8859-1 locales simple aliase on ISO-8859-15 where it makes sense - Add short version of locales - Add @euro aliases on the locales where that make sense Please test the branch and report issues. Note that yes that means the COLLATION_FIX patch on glib2 will not be necessary anymore and yes the icu patch on postgresql will not be necessary anymore Best regards, Bapt signature.asc Description: PGP signature