Re: [HACKERS] Radix tree for character conversion
Hmm, things are bit different. At Thu, 23 Mar 2017 12:13:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170323.121307.241436413.horiguchi.kyot...@lab.ntt.co.jp> > > Ok, I'll write a small script to generate a set of "conversion > > dump" and try to write README.sanity_check describing how to use > > it. > > I found that there's no way to identify the character domain of a > conversion on SQL interface. Unconditionally giving from 0 to > 0x as a bytea string yields too-bloat result by containg > many bogus lines. (If \x40 is a character, convert() also > accepts \x4040, \x404040 and \x40404040) > > One more annoyance is the fact that mappings and conversion > procedures are not in one-to-one correspondence. The > corresnponcence is hidden in conversion_procs/*.c files so we > should extract it from them or provide as knowledge. Both don't > seem good. > > Finally, it seems that I have no choice than resurrecting > map_checker. The exactly the same one no longer works but > map_dumper.c with almost the same structure will work. > > If no one objects to adding map_dumper.c and > gen_mapdumper_header.pl (tentavie name, of course), I'll make a > patch to do that. The scirpt or executable should be compatible between versions but pg_mb_radix_conv is not. On the other hand more upper level API reuiqres server stuff. Finally I made an extension that dumps encoding conversion. encoding_dumper('SJIS', 'UTF-8') or encoding_dumper(35, 6) Then it returns the following output consists of two BYTEAs. srccode | dstcode -+-- \x01| \x01 \x02| \x02 ... \xfc4a | \xe9b899 \xfc4b | \xe9bb91 (7914 rows) This returns in a very short time but doesn't when srccode extends to 4 bytes. As an extreme example the following, > =# select * from encoding_dumper('UTF-8', 'LATIN1'); takes over 2 minutes to return only 255 rows. We cannot determine the exact domain without looking into map data so the function cannot do other than looping through all the four-byte values. Providing a function that gives the domain for a conversion was a mess, especially for artithmetic-conversions. The following query took 94 minutes to give 25M lines/125MB. In short, that's a crap. (the first attached) SELECT x.conname, y.srccode, y.dstcode FROM ( SELECT conname, conforencoding, contoencoding FROM pg_conversion c WHERE pg_char_to_encoding('UTF-8') IN (c.conforencoding, c.contoencoding) AND pg_char_to_encoding('SQL_ASCII') NOT IN (c.conforencoding, c.contoencoding)) as x, LATERAL ( SELECT srccode, dstcode FROM encoding_dumper(x.conforencoding, x.contoencoding)) as y ORDER BY x.conforencoding, x.contoencoding, y.srccode; As the another way, I added a measure to generate plain mapping lists corresponding to .map files (similar to old maps but simpler) and this finishes the work within a second. $ make mapdumps If we will not shortly change the framework of mapped character conversion, the dumper program may be useful but I'm not sure this is reasonable as sanity check for future modifications. In the PoC, pg_mb_radix_tree() is copied into map_checker.c but this needs to be a separate file again. (the second attached) regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From 3763d91d99521f0cfc305bd2199fcb5a263d758d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 27 Mar 2017 09:46:04 +0900 Subject: [PATCH 1/2] encoding_dumper --- contrib/encoding_dumper/Makefile | 17 +++ contrib/encoding_dumper/encoding_dumper--0.1.sql | 23 contrib/encoding_dumper/encoding_dumper.c| 167 +++ contrib/encoding_dumper/encoding_dumper.control | 5 + 4 files changed, 212 insertions(+) create mode 100755 contrib/encoding_dumper/Makefile create mode 100755 contrib/encoding_dumper/encoding_dumper--0.1.sql create mode 100755 contrib/encoding_dumper/encoding_dumper.c create mode 100755 contrib/encoding_dumper/encoding_dumper.control diff --git a/contrib/encoding_dumper/Makefile b/contrib/encoding_dumper/Makefile new file mode 100755 index 000..54e9cfc --- /dev/null +++ b/contrib/encoding_dumper/Makefile @@ -0,0 +1,17 @@ +# src/backend/util/mb/Unicode/encoding_dumper/Makefile + +MODULES = encoding_dumper +EXTENSION = encoding_dumper +DATA = encoding_dumper--0.1.sql +PGFILEDESC = "encoding_dumper - return encoding dump" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/encoding_dumper +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/encoding_dumper/encoding_dumper--0.1.sql b/contrib/encoding_dumper/encoding_dumper--0.1.sql new file mode 100755 index 000..4f67946 --- /dev/null +++ b/contrib/encoding_dumper/encoding_dumper--0.1.sql @@ -0,0 +1,23 @@ +/*
Re: [HACKERS] Radix tree for character conversion
At Tue, 21 Mar 2017 13:10:48 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170321.131048.150321071.horiguchi.kyot...@lab.ntt.co.jp> > At Fri, 17 Mar 2017 13:03:35 +0200, Heikki Linnakangas > wrote in <01efd334-b839-0450-1b63-f2dea9326...@iki.fi> > > On 03/17/2017 07:19 AM, Kyotaro HORIGUCHI wrote: > > > I would like to use convert() function. It can be a large > > > PL/PgSQL function or a series of "SELECT convert(...)"s. The > > > latter is doable on-the-fly (by not generating/storing the whole > > > script). > > > > > > | -- Test for SJIS->UTF-8 conversion > > > | ... > > > | SELECT convert('\', 'SJIS', 'UTF-8'); -- results in error > > > | ... > > > | SELECT convert('\897e', 'SJIS', 'UTF-8'); > > > > Makes sense. > > > > >> You could then run those SQL statements against old and new server > > >> version, and verify that you get the same results. > > > > > > Including the result files in the repository will make this easy > > > but unacceptably bloats. Put mb/Unicode/README.sanity_check? > > > > Yeah, a README with instructions on how to do sounds good. No need to > > include the results in the repository, you can run the script against > > an older version when you need something to compare with. > > Ok, I'll write a small script to generate a set of "conversion > dump" and try to write README.sanity_check describing how to use > it. I found that there's no way to identify the character domain of a conversion on SQL interface. Unconditionally giving from 0 to 0x as a bytea string yields too-bloat result by containg many bogus lines. (If \x40 is a character, convert() also accepts \x4040, \x404040 and \x40404040) One more annoyance is the fact that mappings and conversion procedures are not in one-to-one correspondence. The corresnponcence is hidden in conversion_procs/*.c files so we should extract it from them or provide as knowledge. Both don't seem good. Finally, it seems that I have no choice than resurrecting map_checker. The exactly the same one no longer works but map_dumper.c with almost the same structure will work. If no one objects to adding map_dumper.c and gen_mapdumper_header.pl (tentavie name, of course), I'll make a patch to do that. Any suggestions? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Hello, At Fri, 17 Mar 2017 13:03:35 +0200, Heikki Linnakangaswrote in <01efd334-b839-0450-1b63-f2dea9326...@iki.fi> > On 03/17/2017 07:19 AM, Kyotaro HORIGUCHI wrote: > > I would like to use convert() function. It can be a large > > PL/PgSQL function or a series of "SELECT convert(...)"s. The > > latter is doable on-the-fly (by not generating/storing the whole > > script). > > > > | -- Test for SJIS->UTF-8 conversion > > | ... > > | SELECT convert('\', 'SJIS', 'UTF-8'); -- results in error > > | ... > > | SELECT convert('\897e', 'SJIS', 'UTF-8'); > > Makes sense. > > >> You could then run those SQL statements against old and new server > >> version, and verify that you get the same results. > > > > Including the result files in the repository will make this easy > > but unacceptably bloats. Put mb/Unicode/README.sanity_check? > > Yeah, a README with instructions on how to do sounds good. No need to > include the results in the repository, you can run the script against > an older version when you need something to compare with. Ok, I'll write a small script to generate a set of "conversion dump" and try to write README.sanity_check describing how to use it. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On 03/17/2017 07:19 AM, Kyotaro HORIGUCHI wrote: At Mon, 13 Mar 2017 21:07:39 +0200, Heikki Linnakangaswrote in Hmm. A somewhat different approach might be more suitable for testing across versions, though. We could modify the perl scripts slightly to print out SQL statements that exercise every mapping. For every supported conversion, the SQL script could: 1. create a database in the source encoding. 2. set client_encoding='' 3. SELECT a string that contains every character in the source encoding. There are many encodings that can be client-encoding but cannot be database-encoding. Good point. I would like to use convert() function. It can be a large PL/PgSQL function or a series of "SELECT convert(...)"s. The latter is doable on-the-fly (by not generating/storing the whole script). | -- Test for SJIS->UTF-8 conversion | ... | SELECT convert('\', 'SJIS', 'UTF-8'); -- results in error | ... | SELECT convert('\897e', 'SJIS', 'UTF-8'); Makes sense. You could then run those SQL statements against old and new server version, and verify that you get the same results. Including the result files in the repository will make this easy but unacceptably bloats. Put mb/Unicode/README.sanity_check? Yeah, a README with instructions on how to do sounds good. No need to include the results in the repository, you can run the script against an older version when you need something to compare with. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Thank you for committing this. At Mon, 13 Mar 2017 21:07:39 +0200, Heikki Linnakangaswrote in > On 03/13/2017 08:53 PM, Tom Lane wrote: > > Heikki Linnakangas writes: > >> It would be nice to run the map_checker tool one more time, though, to > >> verify that the mappings match those from PostgreSQL 9.6. > > > > +1 > > > >> Just to be sure, and after that the map checker can go to the dustbin. > > > > Hm, maybe we should keep it around for the next time somebody has a > > bright > > idea in this area? > > The map checker compares old-style maps with the new radix maps. The > next time 'round, we'll need something that compares the radix maps > with the next great thing. Not sure how easy it would be to adapt. > > Hmm. A somewhat different approach might be more suitable for testing > across versions, though. We could modify the perl scripts slightly to > print out SQL statements that exercise every mapping. For every > supported conversion, the SQL script could: > > 1. create a database in the source encoding. > 2. set client_encoding='' > 3. SELECT a string that contains every character in the source > encoding. There are many encodings that can be client-encoding but cannot be database-encoding. And some encodings such as UTF-8 has several one-way conversion. If we do something like this, it would be as the following. 1. Encoding test 1-1. create a database in UTF-8 1-2. set client_encoding='' 1-3. INSERT all characters defined in the source encoding. 1-4. set client_encoding='UTF-8' 1-5. SELECT a string that contains every character in UTF-8. 2. Decoding test sucks! I would like to use convert() function. It can be a large PL/PgSQL function or a series of "SELECT convert(...)"s. The latter is doable on-the-fly (by not generating/storing the whole script). | -- Test for SJIS->UTF-8 conversion | ... | SELECT convert('\', 'SJIS', 'UTF-8'); -- results in error | ... | SELECT convert('\897e', 'SJIS', 'UTF-8'); > You could then run those SQL statements against old and new server > version, and verify that you get the same results. Including the result files in the repository will make this easy but unacceptably bloats. Put mb/Unicode/README.sanity_check? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On Tue, Mar 14, 2017 at 4:07 AM, Heikki Linnakangaswrote: > On 03/13/2017 08:53 PM, Tom Lane wrote: >> Heikki Linnakangas writes: >>> >>> It would be nice to run the map_checker tool one more time, though, to >>> verify that the mappings match those from PostgreSQL 9.6. >> >> +1 Nice to login and see that committed! -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On 03/13/2017 08:53 PM, Tom Lane wrote: Heikki Linnakangaswrites: It would be nice to run the map_checker tool one more time, though, to verify that the mappings match those from PostgreSQL 9.6. +1 Just to be sure, and after that the map checker can go to the dustbin. Hm, maybe we should keep it around for the next time somebody has a bright idea in this area? The map checker compares old-style maps with the new radix maps. The next time 'round, we'll need something that compares the radix maps with the next great thing. Not sure how easy it would be to adapt. Hmm. A somewhat different approach might be more suitable for testing across versions, though. We could modify the perl scripts slightly to print out SQL statements that exercise every mapping. For every supported conversion, the SQL script could: 1. create a database in the source encoding. 2. set client_encoding='' 3. SELECT a string that contains every character in the source encoding. You could then run those SQL statements against old and new server version, and verify that you get the same results. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Heikki Linnakangaswrites: > I did some more kibitzing here and there, and committed. Thanks everyone! 111 files changed, 147742 insertions(+), 367346 deletions(-) Nice. > It would be nice to run the map_checker tool one more time, though, to > verify that the mappings match those from PostgreSQL 9.6. +1 > Just to be sure, and after that the map checker can go to the dustbin. Hm, maybe we should keep it around for the next time somebody has a bright idea in this area? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On 03/06/2017 10:16 AM, Kyotaro HORIGUCHI wrote: At Fri, 3 Mar 2017 12:53:04 +0900, Michael Paquierwrote in On Thu, Mar 2, 2017 at 2:20 PM, Kyotaro HORIGUCHI wrote: 5) Just remove plain map files and all related code. Addition to that, Makefile stores hash digest of authority files in Unicode/authoriy_hashes.txt or something that is managed by git. That may be an idea to check for differences across upstream versions. But that sounds like a separate discussion to me. Fine with me either. I did some more kibitzing here and there, and committed. Thanks everyone! I agree the new maps should just replace the old maps altogether, so committed that way. I also moved the combined map files to the same .map files as the main radix trees. Seems more clear that way to me. I changed the to/from_unicode properties back to a single direction property, with Perl constants BOTH, TO_UNICODE and FROM_UNICODE, per your alternative suggestion upthread. Seems more clear to me. It would be nice to run the map_checker tool one more time, though, to verify that the mappings match those from PostgreSQL 9.6. Just to be sure, and after that the map checker can go to the dustbin. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Hello, At Fri, 3 Mar 2017 12:53:04 +0900, Michael Paquierwrote in > On Thu, Mar 2, 2017 at 2:20 PM, Kyotaro HORIGUCHI > wrote: > > 5) Just remove plain map files and all related code. Addition to > >that, Makefile stores hash digest of authority files in > >Unicode/authoriy_hashes.txt or something that is managed by > >git. > > That may be an idea to check for differences across upstream versions. > But that sounds like a separate discussion to me. Fine with me either. > > This digest may differ among platforms (typically from cr/nl > > difference) but we can assume *nix for the usage. > > > > I will send the next version after this discussion is settled. > > Sure. There is not much point to move on without Heikki's opinion at > least, or anybody else like Ishii-san or Tom who are familiar with > this code. I would think that Heikki would be the committer to pick up > this change though. So, this is the latest version of this patch in the shape of the option 1. | need some extra opinions is what to do with the old maps: | 1) Just remove them, replacing the old maps by the new radix tree maps. | 2) Keep them around in the backend code, even if they are useless. | 3) Use a GUC to be able to switch from one to the other, giving a | fallback method in case of emergency. | 4) Use an extension module to store the old maps with as well the | previous build code, so as sanity checks can still be performed on the | new maps. regards, -- Kyotaro Horiguchi NTT Open Source Software Center 0001-Use-radix-tree-for-character-conversion_20170306.patch.gz Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On Thu, Mar 2, 2017 at 2:20 PM, Kyotaro HORIGUCHIwrote: > 5) Just remove plain map files and all related code. Addition to >that, Makefile stores hash digest of authority files in >Unicode/authoriy_hashes.txt or something that is managed by >git. That may be an idea to check for differences across upstream versions. But that sounds like a separate discussion to me. > This digest may differ among platforms (typically from cr/nl > difference) but we can assume *nix for the usage. > > I will send the next version after this discussion is settled. Sure. There is not much point to move on without Heikki's opinion at least, or anybody else like Ishii-san or Tom who are familiar with this code. I would think that Heikki would be the committer to pick up this change though. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
At Wed, 1 Mar 2017 14:34:23 +0900, Michael Paquierwrote in > On Tue, Feb 28, 2017 at 5:34 PM, Kyotaro HORIGUCHI > wrote: > > At Tue, 28 Feb 2017 15:20:06 +0900, Michael Paquier > > wrote in > > > >> +conv.o: conv.c char_converter.c > >> This also can go away. > > > > Touching char_converter.c will be ignored if it is removed. Did > > you mistake it for map_checker? > > That was not what I meant: as pg_mb_radix_conv() is only used in > conv.c, it may be better to just remove completely char_converter.c. Ouch! You're right. Sorry for my short-sight. char_converter.c is removed and related description in Makefile is removed. > > And the code-comment pointed in the comment by the previous mail > > is rewritten as Robert's suggestion. > > Fine for me. > > -distclean: clean > +distclean: > rm -f $(TEXTS) > > -maintainer-clean: distclean > - rm -f $(MAPS) > - > +maintainer-clean: > + rm -f $(TEXTS) $(MAPS) > Well, I would have assumed that this should not change.. I should have reverted there but actually the patch somehow does the different thing.. Surely reverted it this time. > The last version of the patch looks in rather good shape to me, we are > also sure that the sanity checks on the old maps and the new maps > match per the previous runs with map_checker. Agreed. > One thing that still > need some extra opinions is what to do with the old maps: > 1) Just remove them, replacing the old maps by the new radix tree maps. > 2) Keep them around in the backend code, even if they are useless. > 3) Use a GUC to be able to switch from one to the other, giving a > fallback method in case of emergency. > 4) Use an extension module to store the old maps with as well the > previous build code, so as sanity checks can still be performed on the > new maps. > > I would vote for 2), to reduce long term maintenance burdens and after > seeing all the sanity checks that have been done in previous versions. I don't vote 3 and 4. And I did 1 in the last patch. About 2, any change in the authority files rarely but possiblly happens. Even in the case the plain map files are no longer can be generated. (but can with a bit tweak of convutils.pm) If the radix-tree file generator is under a suspicion, the "plain" map file generator (and the map_checker) or some other means to sanity check might be required. That being said, when something occurs in radix files, we can find it in the radix file and can find the corresponding lines in the aurhority file. The remaining problem is the case where some substantial change in authority files doesn't affect radix files. We can detect such mistake by detecting changes in authority files. So I propose the 5th option. 5) Just remove plain map files and all related code. Addition to that, Makefile stores hash digest of authority files in Unicode/authoriy_hashes.txt or something that is managed by git. This digest may differ among platforms (typically from cr/nl difference) but we can assume *nix for the usage. I will send the next version after this discussion is settled. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On Tue, Feb 28, 2017 at 5:34 PM, Kyotaro HORIGUCHIwrote: > At Tue, 28 Feb 2017 15:20:06 +0900, Michael Paquier > wrote in > >> +conv.o: conv.c char_converter.c >> This also can go away. > > Touching char_converter.c will be ignored if it is removed. Did > you mistake it for map_checker? That was not what I meant: as pg_mb_radix_conv() is only used in conv.c, it may be better to just remove completely char_converter.c. > And the code-comment pointed in the comment by the previous mail > is rewritten as Robert's suggestion. Fine for me. -distclean: clean +distclean: rm -f $(TEXTS) -maintainer-clean: distclean - rm -f $(MAPS) - +maintainer-clean: + rm -f $(TEXTS) $(MAPS) Well, I would have assumed that this should not change.. The last version of the patch looks in rather good shape to me, we are also sure that the sanity checks on the old maps and the new maps match per the previous runs with map_checker. One thing that still need some extra opinions is what to do with the old maps: 1) Just remove them, replacing the old maps by the new radix tree maps. 2) Keep them around in the backend code, even if they are useless. 3) Use a GUC to be able to switch from one to the other, giving a fallback method in case of emergency. 4) Use an extension module to store the old maps with as well the previous build code, so as sanity checks can still be performed on the new maps. I would vote for 2), to reduce long term maintenance burdens and after seeing all the sanity checks that have been done in previous versions. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
At Tue, 28 Feb 2017 15:20:06 +0900, Michael Paquierwrote in > On Mon, Feb 27, 2017 at 5:37 PM, Kyotaro HORIGUCHI > wrote: > > At Wed, 22 Feb 2017 16:06:14 +0900, Michael Paquier > > wrote in > > > >> In order to conduct sanity checks on the shape of the radix tree maps > >> compared to the existing maps, having map_checker surely makes sense. > >> Now in the final result I don't think we need it. The existing map > >> files ought to be replaced by their radix versions at the end, and > >> map_checker should be removed. This leads to a couple of > >> simplifications, like Makefile, and reduces the maintenance to one > >> mechanism. > > > > Hmm.. Though I don't remember clearly what the radix map of the > > first version looked like, the current radix map seems > > human-readable for me. It might be by practice or by additional > > comments in map files. Anyway I removed all of the stuff so as > > not to generate the plain maps. But I didn't change the names of > > _radix.map and just commented out the line to output the plain > > maps in UCS_to_*.pl. Combined maps are still in the plain format > > so print_tables was changed to take character tables separately > > for regular (non-combined) characters and combined characters. > > Do others have thoughts to offer on the matter? I would think that the > new radix maps should just replace by the old plain ones, and that the > only way to build the maps going forward is to use the new methods. > The radix trees is the only thing used in the backend code as well > (conv.c). We could keep the way to build the old maps, with the > map_checker in module out of the core code. FWIW, I am fine to add the > old APIs in my plugin repository on github and have the sanity checks > in that as well. And of course also publish on this thread a module to > do that. I couldn't make out my mind to move to radix tree completely, but UtfToLocal/LocalToUtf no longer handle the "plain map"s for non-combined character so they have lost their planground. Okay, I think I removed all the trace of the plain map era. Every characters in a mapping has a comment that describes what the character is or where it is defined. This information is no longer useful (radix map doesn't have a plance to show it) but I left it for debug use. (This might just be justification..) > > - Split the property {direction} into two boolean properties > > {to_unicode} and {from_unicode}. > > > > - Make the {direction} property an integer and compared with > > defined constants $BOTH, $TO_UNICODE and $FROM_UNICODE using > > the '=' operator. > > > > I choosed the former in this patch. > > Fine for me. Thanks. > >> + $charmap{ ucs2utf($src) } = $dst; > >> + } > >> + > >> + } > >> Unnecessary newline here. > > > > removed in convutils.pm. > > > > Since Makefile ignores old .map files, the steps to generate a > > patch for map files was a bit chaged. > > > > $ rm *.map > > $ make distclean maintainer-clean all > > $ make distclean > > $ git add . > > $ git commit > > +# ignore generated files > +/map_checker > +/map_checker.h > [...] > +map_checker.h: make_mapchecker.pl $(MAPS) $(RADIXMAPS) > + $(PERL) $< > + > +map_checker.o: map_checker.c map_checker.h ../char_converter.c > + > +map_checker: map_checker.o > With map_checker out of the game, those things are not needed. Ouch! Thanks for pointing out it. Removed. > +++ b/src/backend/utils/mb/char_converter.c > @@ -0,0 +1,116 @@ > +/*- > + * > + * Character converter function using radix tree > In the simplified version of the patch, pg_mb_radix_conv() being only > needed in conv.c I think that this could just be a static local > routine. > > -#include "../../Unicode/utf8_to_koi8r.map" > -#include "../../Unicode/koi8r_to_utf8.map" > -#include "../../Unicode/utf8_to_koi8u.map" > -#include "../../Unicode/koi8u_to_utf8.map" > +#include "../../Unicode/utf8_to_koi8r_radix.map" > +#include "../../Unicode/koi8r_to_utf8_radix.map" > +#include "../../Unicode/utf8_to_koi8u_radix.map" > +#include "../../Unicode/koi8u_to_utf8_radix.map" > FWIW, I am fine to use those new names as include points. > > -distclean: clean > +distclean: > rm -f $(TEXTS) > -maintainer-clean: distclean > +# maintainer-clean intentionally leaves $(TEXTS) > +maintainer-clean: > Why is that? There is also a useless diff down that code block. It *was* for convenience but now it is automatically downloaded so such distinction donsn't offer anything good. Changed it to remove $(TEXTS). > +conv.o: conv.c char_converter.c > This also can go away. Touching char_converter.c will be ignored if it is removed. Did you mistake it for map_checker? >
Re: [HACKERS] Radix tree for character conversion
On Mon, Feb 27, 2017 at 5:37 PM, Kyotaro HORIGUCHIwrote: > At Wed, 22 Feb 2017 16:06:14 +0900, Michael Paquier > wrote in > >> In order to conduct sanity checks on the shape of the radix tree maps >> compared to the existing maps, having map_checker surely makes sense. >> Now in the final result I don't think we need it. The existing map >> files ought to be replaced by their radix versions at the end, and >> map_checker should be removed. This leads to a couple of >> simplifications, like Makefile, and reduces the maintenance to one >> mechanism. > > Hmm.. Though I don't remember clearly what the radix map of the > first version looked like, the current radix map seems > human-readable for me. It might be by practice or by additional > comments in map files. Anyway I removed all of the stuff so as > not to generate the plain maps. But I didn't change the names of > _radix.map and just commented out the line to output the plain > maps in UCS_to_*.pl. Combined maps are still in the plain format > so print_tables was changed to take character tables separately > for regular (non-combined) characters and combined characters. Do others have thoughts to offer on the matter? I would think that the new radix maps should just replace by the old plain ones, and that the only way to build the maps going forward is to use the new methods. The radix trees is the only thing used in the backend code as well (conv.c). We could keep the way to build the old maps, with the map_checker in module out of the core code. FWIW, I am fine to add the old APIs in my plugin repository on github and have the sanity checks in that as well. And of course also publish on this thread a module to do that. >> Or if you want to go to the road of non-simple things, you >> could have two arguments: an origin and a target. If one is >> UTF8 the other is the mapping name. > > Mmmm. It seems (even) to me to give more harm than good. I can > guess two alternatives for this. > > - Split the property {direction} into two boolean properties > {to_unicode} and {from_unicode}. > > - Make the {direction} property an integer and compared with > defined constants $BOTH, $TO_UNICODE and $FROM_UNICODE using > the '=' operator. > > I choosed the former in this patch. Fine for me. >> + $charmap{ ucs2utf($src) } = $dst; >> + } >> + >> + } >> Unnecessary newline here. > > removed in convutils.pm. > > Since Makefile ignores old .map files, the steps to generate a > patch for map files was a bit chaged. > > $ rm *.map > $ make distclean maintainer-clean all > $ make distclean > $ git add . > $ git commit +# ignore generated files +/map_checker +/map_checker.h [...] +map_checker.h: make_mapchecker.pl $(MAPS) $(RADIXMAPS) + $(PERL) $< + +map_checker.o: map_checker.c map_checker.h ../char_converter.c + +map_checker: map_checker.o With map_checker out of the game, those things are not needed. +++ b/src/backend/utils/mb/char_converter.c @@ -0,0 +1,116 @@ +/*- + * + * Character converter function using radix tree In the simplified version of the patch, pg_mb_radix_conv() being only needed in conv.c I think that this could just be a static local routine. -#include "../../Unicode/utf8_to_koi8r.map" -#include "../../Unicode/koi8r_to_utf8.map" -#include "../../Unicode/utf8_to_koi8u.map" -#include "../../Unicode/koi8u_to_utf8.map" +#include "../../Unicode/utf8_to_koi8r_radix.map" +#include "../../Unicode/koi8r_to_utf8_radix.map" +#include "../../Unicode/utf8_to_koi8u_radix.map" +#include "../../Unicode/koi8u_to_utf8_radix.map" FWIW, I am fine to use those new names as include points. -distclean: clean +distclean: rm -f $(TEXTS) -maintainer-clean: distclean +# maintainer-clean intentionally leaves $(TEXTS) +maintainer-clean: Why is that? There is also a useless diff down that code block. +conv.o: conv.c char_converter.c This also can go away. -print_tables("EUC_JIS_2004", \@all, 1); +# print_tables("EUC_JIS_2004", \@regular, undef, 1); +print_radix_trees("EUC_JIS_2004", \@regular); +print_tables("EUC_JIS_2004", undef, \@combined, 1); [...] sub print_tables { - my ($charset, $table, $verbose) = @_; + my ($charset, $regular, $combined, $verbose) = @_; print_tables is only used for combined maps, you could remove $regular from it and just keep $combined around, perhaps renaming print_tables to print_combined_maps? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Hello, At Tue, 28 Feb 2017 08:00:22 +0530, Robert Haaswrote in > On Mon, Feb 27, 2017 at 2:07 PM, Kyotaro HORIGUCHI > wrote: > >> +# make_charmap - convert charset table to charmap hash > >> +# with checking duplicate source code > >> Maybe this should be "with checking of duplicated source codes". > > > > Even though I'm not good English writer, 'duplicated codes' looks > > as multiple copies of the original 'code' (for me, of > > course). And the 'checking' is a (pure) verbal noun (means not a > > deverbal noun) so 'of' is not required. But, of course, I'm not > > sure which sounds more natural as English > > The problem is that, because "checking" is a noun in this sentence, it > can't be followed by a direct object so you need "of" to connect > "checking" with the thing that is being checked. However, what I > would do is rearrange this sentence slightly as to use "checking" as a > verb, like this: > > convert charset table to charmap hash, checking for duplicate source > codes along the way > > While I don't think Michael's suggestion is wrong, I find the above a > little more natural. Thank you for the suggestion and explantion. I leaned that, maybe. I'll send the version with the revised comment. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On Mon, Feb 27, 2017 at 2:07 PM, Kyotaro HORIGUCHIwrote: >> +# make_charmap - convert charset table to charmap hash >> +# with checking duplicate source code >> Maybe this should be "with checking of duplicated source codes". > > Even though I'm not good English writer, 'duplicated codes' looks > as multiple copies of the original 'code' (for me, of > course). And the 'checking' is a (pure) verbal noun (means not a > deverbal noun) so 'of' is not required. But, of course, I'm not > sure which sounds more natural as English The problem is that, because "checking" is a noun in this sentence, it can't be followed by a direct object so you need "of" to connect "checking" with the thing that is being checked. However, what I would do is rearrange this sentence slightly as to use "checking" as a verb, like this: convert charset table to charmap hash, checking for duplicate source codes along the way While I don't think Michael's suggestion is wrong, I find the above a little more natural. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Thank you for the comment. At Wed, 22 Feb 2017 16:06:14 +0900, Michael Paquierwrote in > Thanks for the rebase. I have been spending sore time looking at this > patch. The new stuff in convutils.pm is by far the interesting part of > the patch, where the building of the radix trees using a byte > structure looks in pretty good shape after eyeballing the logic for a > couple of hours. > > +# ignore backup files of editors > +/*[~#] > + > This does not belong to Postgres core code. You could always set up > that in a global exclude file with core.excludefiles. Thank you for letting me know about it. I removed that. > In order to conduct sanity checks on the shape of the radix tree maps > compared to the existing maps, having map_checker surely makes sense. > Now in the final result I don't think we need it. The existing map > files ought to be replaced by their radix versions at the end, and > map_checker should be removed. This leads to a couple of > simplifications, like Makefile, and reduces the maintenance to one > mechanism. Hmm.. Though I don't remember clearly what the radix map of the first version looked like, the current radix map seems human-readable for me. It might be by practice or by additional comments in map files. Anyway I removed all of the stuff so as not to generate the plain maps. But I didn't change the names of _radix.map and just commented out the line to output the plain maps in UCS_to_*.pl. Combined maps are still in the plain format so print_tables was changed to take character tables separately for regular (non-combined) characters and combined characters. > +sub print_radix_trees > +{ > + my ($this_script, $csname, $charset) = @_; > + > + _radix_map($this_script, $csname, "from_unicode", $charset, 78); > + _radix_map($this_script, $csname, "to_unicode", $charset, 78); > +} > There is no need for the table width to be defined as a variable (5th > argument). The table width was already useless.. Removed. > Similarly, to_unicode/from_unicode require checks in > multiple places, this could be just a simple boolean flag. The direction is a tristate (to/from/both) variable so cannot be replaced with a boolean. But I agree that comparing with free string is not so good. This is a change already committed in the master but it is changed in the attached patch. # Perhaps it is easier to read in string form.. > Or if you want to go to the road of non-simple things, you > could have two arguments: an origin and a target. If one is > UTF8 the other is the mapping name. Mmmm. It seems (even) to me to give more harm than good. I can guess two alternatives for this. - Split the property {direction} into two boolean properties {to_unicode} and {from_unicode}. - Make the {direction} property an integer and compared with defined constants $BOTH, $TO_UNICODE and $FROM_UNICODE using the '=' operator. I choosed the former in this patch. > +sub dump_charset > +{ > + my ($list, $filt) = @_; > + > + foreach my $i (@$list) > + { > + next if (defined $filt && !&$filt($i)); > + if (!defined $i->{ucs}) { $i->{ucs} = ($i->{utf8}); } > + printf "ucs=%x, code=%x, direction=%s %s:%d %s\n", > + $i->{ucs}, $i->{code}, $i->{direction}, > + $i->{f}, $i->{l},$i->{comment}; > + } > +} > This is used nowhere. Perhaps it was useful for debugging at some point? Yes, it was quite useful. Removed. > +# make_charmap - convert charset table to charmap hash > +# with checking duplicate source code > Maybe this should be "with checking of duplicated source codes". Even though I'm not good English writer, 'duplicated codes' looks as multiple copies of the original 'code' (for me, of course). And the 'checking' is a (pure) verbal noun (means not a deverbal noun) so 'of' is not required. But, of course, I'm not sure which sounds more natural as English. This comment is not changed. > +# print_radix_map($this_script, $csname, $direction, \%charset, $tblwidth) > +# > +# this_script - the name of the *caller script* of this feature > $this_script is not needed at all, you could just use basename($0) and > reduce the number of arguments of the different functions of the > stack. I avoided relying on global stuff by that but I accept the suggestion. Fixed in this version. > + ### amount of zeros that can be ovarlaid. > s/ovarlaid/overlaid. > > +# make_mapchecker.pl - Gerates map_checker.h file included by map_checker.c > s/gerates/generates/ make_mapcheker.pl is a tool only for map_checker.c so this files is removed. > + if (s < 0x80) > + { > + fprintf(stderr, "\nASCII character ? (%x)", s); > + exit(1); > + } > Most likely a newline at the end of the error string is better here. map_checker.c is removed. > + $charmap{ ucs2utf($src) } = $dst; > + } > + > + } >
Re: [HACKERS] Radix tree for character conversion
On Fri, Feb 3, 2017 at 1:18 PM, Kyotaro HORIGUCHIwrote: > Thanks to that Heikki have pushed the first two patches and a > part of the third, only one patch is remaining now. > > # Sorry for not separating KOI8 stuffs. > > At Tue, 31 Jan 2017 19:06:09 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI > wrote in > <20170131.190609.254672218.horiguchi.kyot...@lab.ntt.co.jp> >> > Thanks for the new version, I'll look at it once I am done with the >> > cleanup of the current CF. For now I have moved it to the CF 2017-03. >> >> Agreed. Thank you. > > Attached is the latest version on the current master (555494d). > > Note: since this patch is created by git diff --irreversble-delete, > three files mb/Unicode/*.(txt|xml) to be deleted are left alone. Thanks for the rebase. I have been spending sore time looking at this patch. The new stuff in convutils.pm is by far the interesting part of the patch, where the building of the radix trees using a byte structure looks in pretty good shape after eyeballing the logic for a couple of hours. +# ignore backup files of editors +/*[~#] + This does not belong to Postgres core code. You could always set up that in a global exclude file with core.excludefiles. In order to conduct sanity checks on the shape of the radix tree maps compared to the existing maps, having map_checker surely makes sense. Now in the final result I don't think we need it. The existing map files ought to be replaced by their radix versions at the end, and map_checker should be removed. This leads to a couple of simplifications, like Makefile, and reduces the maintenance to one mechanism. +sub print_radix_trees +{ + my ($this_script, $csname, $charset) = @_; + + _radix_map($this_script, $csname, "from_unicode", $charset, 78); + _radix_map($this_script, $csname, "to_unicode", $charset, 78); +} There is no need for the table width to be defined as a variable (5th argument). Similarly, to_unicode/from_unicode require checks in multiple places, this could be just a simple boolean flag. Or if you want to go to the road of non-simple things, you could have two arguments: an origin and a target. If one is UTF8 the other is the mapping name. +sub dump_charset +{ + my ($list, $filt) = @_; + + foreach my $i (@$list) + { + next if (defined $filt && !&$filt($i)); + if (!defined $i->{ucs}) { $i->{ucs} = ($i->{utf8}); } + printf "ucs=%x, code=%x, direction=%s %s:%d %s\n", + $i->{ucs}, $i->{code}, $i->{direction}, + $i->{f}, $i->{l},$i->{comment}; + } +} This is used nowhere. Perhaps it was useful for debugging at some point? +# make_charmap - convert charset table to charmap hash +# with checking duplicate source code Maybe this should be "with checking of duplicated source codes". +# print_radix_map($this_script, $csname, $direction, \%charset, $tblwidth) +# +# this_script - the name of the *caller script* of this feature $this_script is not needed at all, you could just use basename($0) and reduce the number of arguments of the different functions of the stack. + ### amount of zeros that can be ovarlaid. s/ovarlaid/overlaid. +# make_mapchecker.pl - Gerates map_checker.h file included by map_checker.c s/gerates/generates/ + if (s < 0x80) + { + fprintf(stderr, "\nASCII character ? (%x)", s); + exit(1); + } Most likely a newline at the end of the error string is better here. + $charmap{ ucs2utf($src) } = $dst; + } + + } Unnecessary newline here. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Tnanks to that Heikki have pushed the first two patches and a part of the third, only one patch is remaining now. # Sorry for not separating KOI8 stuffs. At Tue, 31 Jan 2017 19:06:09 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170131.190609.254672218.horiguchi.kyot...@lab.ntt.co.jp> > > Thanks for the new version, I'll look at it once I am done with the > > cleanup of the current CF. For now I have moved it to the CF 2017-03. > > Agreed. Thank you. Attached is the latest version on the current master (555494d). Note: since this patch is created by git diff --irreversble-delete, three files mb/Unicode/*.(txt|xml) to be deleted are left alone. regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From 68d75100b7e8aaab7706ea780a1e23557c676c87 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 10 Jan 2017 20:02:00 +0900 Subject: [PATCH] Use radix tree for character conversion This patch adds multibyte character converter based using radix tree based on Heikki's rework of my previous patch. --- src/backend/utils/mb/Makefile | 2 + src/backend/utils/mb/Unicode/.gitignore|11 + src/backend/utils/mb/Unicode/Makefile |72 +- src/backend/utils/mb/Unicode/UCS_to_BIG5.pl| 9 +- src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl | 9 +- .../utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl|19 +- src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl | 6 +- src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl |13 +- src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl | 9 +- src/backend/utils/mb/Unicode/UCS_to_GB18030.pl | 9 +- src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl |11 +- .../utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl |14 +- src/backend/utils/mb/Unicode/UCS_to_SJIS.pl|29 +- src/backend/utils/mb/Unicode/UCS_to_UHC.pl |11 +- src/backend/utils/mb/Unicode/UCS_to_most.pl| 5 +- src/backend/utils/mb/Unicode/convutils.pm | 679 +- src/backend/utils/mb/Unicode/euc-jis-2004-std.txt | 11549 --- src/backend/utils/mb/Unicode/gb-18030-2000.xml | 30916 --- src/backend/utils/mb/Unicode/make_mapchecker.pl|78 + src/backend/utils/mb/Unicode/map_checker.c |94 + .../utils/mb/Unicode/sjis-0213-2004-std.txt| 11549 --- src/backend/utils/mb/char_converter.c | 116 + src/backend/utils/mb/conv.c| 137 +- .../conversion_procs/utf8_and_big5/utf8_and_big5.c | 8 +- .../utf8_and_cyrillic/utf8_and_cyrillic.c |16 +- .../utf8_and_euc2004/utf8_and_euc2004.c| 8 +- .../utf8_and_euc_cn/utf8_and_euc_cn.c | 8 +- .../utf8_and_euc_jp/utf8_and_euc_jp.c | 8 +- .../utf8_and_euc_kr/utf8_and_euc_kr.c | 8 +- .../utf8_and_euc_tw/utf8_and_euc_tw.c | 8 +- .../utf8_and_gb18030/utf8_and_gb18030.c| 8 +- .../conversion_procs/utf8_and_gbk/utf8_and_gbk.c | 8 +- .../utf8_and_iso8859/utf8_and_iso8859.c| 127 +- .../utf8_and_johab/utf8_and_johab.c| 8 +- .../conversion_procs/utf8_and_sjis/utf8_and_sjis.c | 8 +- .../utf8_and_sjis2004/utf8_and_sjis2004.c | 8 +- .../conversion_procs/utf8_and_uhc/utf8_and_uhc.c | 8 +- .../conversion_procs/utf8_and_win/utf8_and_win.c |98 +- src/include/mb/pg_wchar.h |56 +- 39 files changed, 1355 insertions(+), 54385 deletions(-) create mode 100644 src/backend/utils/mb/Unicode/.gitignore delete mode 100644 src/backend/utils/mb/Unicode/euc-jis-2004-std.txt delete mode 100644 src/backend/utils/mb/Unicode/gb-18030-2000.xml create mode 100755 src/backend/utils/mb/Unicode/make_mapchecker.pl create mode 100644 src/backend/utils/mb/Unicode/map_checker.c delete mode 100644 src/backend/utils/mb/Unicode/sjis-0213-2004-std.txt create mode 100644 src/backend/utils/mb/char_converter.c diff --git a/src/backend/utils/mb/Makefile b/src/backend/utils/mb/Makefile index 89bec21..d48e729 100644 --- a/src/backend/utils/mb/Makefile +++ b/src/backend/utils/mb/Makefile @@ -14,6 +14,8 @@ include $(top_builddir)/src/Makefile.global OBJS = encnames.o conv.o mbutils.o wchar.o wstrcmp.o wstrncmp.o +conv.o: conv.c char_converter.c + include $(top_srcdir)/src/backend/common.mk clean distclean maintainer-clean: diff --git a/src/backend/utils/mb/Unicode/.gitignore b/src/backend/utils/mb/Unicode/.gitignore new file mode 100644 index 000..3908cc3 --- /dev/null +++ b/src/backend/utils/mb/Unicode/.gitignore @@ -0,0 +1,11 @@ +# ignore backup files of editors +/*[~#] + +# ignore authority files +/*.TXT +/*.txt +/*.xml + +# ignore generated files +/map_checker +/map_checker.h diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile
Re: [HACKERS] Radix tree for character conversion
At Tue, 31 Jan 2017 12:25:46 +0900, Michael Paquierwrote in > On Mon, Jan 30, 2017 at 3:37 PM, Kyotaro HORIGUCHI > wrote: > > Hello, this is the revised version of character conversion using radix tree. > > Thanks for the new version, I'll look at it once I am done with the > cleanup of the current CF. For now I have moved it to the CF 2017-03. Agreed. Thank you. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On Mon, Jan 30, 2017 at 3:37 PM, Kyotaro HORIGUCHIwrote: > Hello, this is the revised version of character conversion using radix tree. Thanks for the new version, I'll look at it once I am done with the cleanup of the current CF. For now I have moved it to the CF 2017-03. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Hello, this is the revised version of character conversion using radix tree. At Fri, 27 Jan 2017 17:33:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170127.173357.221584433.horiguchi.kyot...@lab.ntt.co.jp> > Hi, this is an intermediate report without a patch. > > At Thu, 26 Jan 2017 21:42:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI > wrote in > <20170126.214212.111556326.horiguchi.kyot...@lab.ntt.co.jp> > > > > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch > > > > > > > > Before adding radix tree stuff, applied pgperltidy and inserted > > > > format-skipping pragma for the parts where perltidy seems to do > > > > too much. > > > > > > Which version of perltidy did you use? Looking at the archives, the > > > perl code is cleaned up with a specific version, v20090616. See > > > https://www.postgresql.org/message-id/20151204054322.ga2070...@tornado.leadboat.com > > > for example on the matter. As perltidy changes over time, this may be > > > a sensitive change if done this way. > > My perltidy -v said "v20121207'. Anyway, I gave up to apply > perltidy by myself. So I'll just drop 0003 and new 0004 (name > changed to 0003) is made immediately on 0002. I'm not sure what to handle this so I just removed the perltidy stuff from this patchset. > > > > 0004-Use-radix-tree-for-character-conversion.patch > > > > > > > > Radix tree body. > > > > > > Well, here a lot of diffs could have been saved. > > > > > > > The unattached fifth patch is generated by the following steps. > > > > > > > > [$(TOP)]$ ./configure > > > > [Unicode]$ make > > > > [Unicode]$ make distclean > > > > [Unicode]$ git add . > > > > [Unicode]$ commit > > > > === COMMITE MESSSAGE > > > > Replace map files with radix tree files. > > > > > > > > These encodings no longer uses the former map files and uses new radix > > > > tree files. > > > > === > > > > > > OK, I can see that working, with 200k of maps generated.. So going > > > through the important bits of this jungle.. > > > > Many thaks for the exploration. > > > > > +/* > > > + * radix tree conversion function - this should be identical to the > > > function in > > > + * ../conv.c with the same name > .. > > > This is not nice. Having a duplication like that is a recipe to forget > > > about it as this patch introduces a dependency with conv.c and the > > > radix tree generation. > > In the attatched patch, mb/char_conveter.c which contains one > inline function is created and it is includ'ed from mb/conv.c and > mb/Unicode/map_checker.c. > > > > Having a .gitignore in Unicode/ would be nice, particularly to avoid > > > committing map_checker. > > I missed this. I added .gitignore to ignore map_checker stuff > and authority files and old-style map files. > > > > A README documenting things may be welcome, or at least comments at > > > the top of map_checker.c. Why is map_checker essential? What does it > > > do? There is no way to understand that easily, except that it includes > > > a "radix tree conversion function", and that it performs sanity checks > > > on the radix trees to be sure that they are on a good shape. But as > > > this something that one would guess only after looking at your patch > > > and the code (at least I will sleep less stupid tonight after reading > > > this stuff). > > > > Okay, I'll do that. > > The patch has not been provided yet, I'm going to put the > following comment just before the main() in map_checker.c. > > /* > * The old-style plain map files were error-resistant due to its > * straight-forward way for generation from authority files. In contrast the > * radix tree maps are generated by a rather complex calculation and have a > * complex, hard-to-confirm format. > * > * This program runs sanity check of the radix tree maps by confirming all > * characters in the plain map files to be converted to the same code by the > * corresponding radix tree map. > * > * All map files are included by map_checker.h that is generated by the script > * make_mapchecker.pl as the variable mappairs. > * > */ > > > I'll do the following things later. The following is the continuation. > > --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl > > +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl > > # Drop these SJIS codes from the source for UTF8=>SJIS conversion > > #<<< do not let perltidy touch this > > -my @reject_sjis =( > > +my @reject_sjis = ( > > 0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782, > > - 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, > > + 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, > > 0x879a..0x879c > > -); > > + ); > > This is not generated, it would be nice to drop the noise from the patch. > > Mmm. I'm not sure how this is generated but I'll care for that. I don't still understand what what the intermediate diff comes from but copy-n-pasting from master silenced
Re: [HACKERS] Radix tree for character conversion
Hi, this is an intermediate report without a patch. At Thu, 26 Jan 2017 21:42:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170126.214212.111556326.horiguchi.kyot...@lab.ntt.co.jp> > > > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch > > > > > > Before adding radix tree stuff, applied pgperltidy and inserted > > > format-skipping pragma for the parts where perltidy seems to do > > > too much. > > > > Which version of perltidy did you use? Looking at the archives, the > > perl code is cleaned up with a specific version, v20090616. See > > https://www.postgresql.org/message-id/20151204054322.ga2070...@tornado.leadboat.com > > for example on the matter. As perltidy changes over time, this may be > > a sensitive change if done this way. > > Hmm. I will make a confirmation on that.. tomorrow. My perltidy -v said "v20121207'. Anyway, I gave up to apply perltidy by myself. So I'll just drop 0003 and new 0004 (name changed to 0003) is made immediately on 0002. > > > 0004-Use-radix-tree-for-character-conversion.patch > > > > > > Radix tree body. > > > > Well, here a lot of diffs could have been saved. > > > > > The unattached fifth patch is generated by the following steps. > > > > > > [$(TOP)]$ ./configure > > > [Unicode]$ make > > > [Unicode]$ make distclean > > > [Unicode]$ git add . > > > [Unicode]$ commit > > > === COMMITE MESSSAGE > > > Replace map files with radix tree files. > > > > > > These encodings no longer uses the former map files and uses new radix > > > tree files. > > > === > > > > OK, I can see that working, with 200k of maps generated.. So going > > through the important bits of this jungle.. > > Many thaks for the exploration. > > > +/* > > + * radix tree conversion function - this should be identical to the > > function in > > + * ../conv.c with the same name .. > > This is not nice. Having a duplication like that is a recipe to forget > > about it as this patch introduces a dependency with conv.c and the > > radix tree generation. In the attatched patch, mb/char_conveter.c which contains one inline function is created and it is includ'ed from mb/conv.c and mb/Unicode/map_checker.c. > > Having a .gitignore in Unicode/ would be nice, particularly to avoid > > committing map_checker. I missed this. I added .gitignore to ignore map_checker stuff and authority files and old-style map files. > > A README documenting things may be welcome, or at least comments at > > the top of map_checker.c. Why is map_checker essential? What does it > > do? There is no way to understand that easily, except that it includes > > a "radix tree conversion function", and that it performs sanity checks > > on the radix trees to be sure that they are on a good shape. But as > > this something that one would guess only after looking at your patch > > and the code (at least I will sleep less stupid tonight after reading > > this stuff). > > Okay, I'll do that. The patch has not been provided yet, I'm going to put the following comment just before the main() in map_checker.c. /* * The old-style plain map files were error-resistant due to its * straight-forward way for generation from authority files. In contrast the * radix tree maps are generated by a rather complex calculation and have a * complex, hard-to-confirm format. * * This program runs sanity check of the radix tree maps by confirming all * characters in the plain map files to be converted to the same code by the * corresponding radix tree map. * * All map files are included by map_checker.h that is generated by the script * make_mapchecker.pl as the variable mappairs. * */ I'll do the following things later. > > --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl > > +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl > > # Drop these SJIS codes from the source for UTF8=>SJIS conversion > > #<<< do not let perltidy touch this > > -my @reject_sjis =( > > +my @reject_sjis = ( > > 0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782, > > - 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, > > + 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, > > 0x879a..0x879c > > -); > > + ); > > This is not generated, it would be nice to drop the noise from the patch. > > Mmm. I'm not sure how this is generated but I'll care for that. > > > Here is another one: > > - $i->{code} = $jis | ( > > - $jis < 0x100 > > - ? 0x8e00 > > - : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080)); > > - > > +#<<< do not let perltidy touch this > > + $i->{code} = $jis | ($jis < 0x100 ? 0x8e00: > > +($sjis >= 0xeffd ? 0x8f8080 : 0x8080)); > > +#>>> > > Ok. Will revert this. > > > if (l == 2) > > { > > - iutf = *utf++ << 8; > > - iutf |= *utf++; > > + b3 = *utf++; > > + b4 = *utf++; > > } > > Ah, OK. This conversion is important so as it performs a
Re: [HACKERS] Radix tree for character conversion
Thank you for looking this. At Thu, 26 Jan 2017 16:28:16 +0900, Michael Paquierwrote in > On Tue, Jan 10, 2017 at 8:22 PM, Kyotaro HORIGUCHI > wrote: > > [...patch...] > > Nobody has showed up yet to review this patch, so I am giving it a shot. > > The patch file sizes are scary at first sight, but after having a look: > 36 files changed, 1411 insertions(+), 54398 deletions(-) > Yes that's a surprise, something like git diff --irreversible-delete > would have helped as most of the diffs are just caused by 3 files > being deleted in patch 0004, making 50k lines going to the abyss of > deletion. Thank you. Good to hear that. I'll try that at the next chance. > > Hello, I found a bug in my portion while rebasing. > > Right, that's 0001. Nice catch. > > > The attached files are the following. This patchset is not > > complete missing changes of map files. The change is tremendously > > large but generatable. > > > > 0001-Add-missing-semicolon.patch > > > > UCS_to_EUC_JP.pl has a line missing teminating semicolon. This > > doesn't harm but surely a syntax error. This patch fixes it. > > This might should be a separate patch. > > This requires a back-patch. This makes me wonder how long this script > has actually not run... > > > 0002-Correct-reference-resolution-syntax.patch > > > > convutils.pm has lines with different syntax of reference > > resolution. This unifies the syntax. > > Yes that looks right to me. Yes, I thoght that the three patches can be back-patched, a kind of bug fix. > I am the best perl guru on this list but > looking around $$var{foo} is bad, ${$var}{foo} is better, and > $var->{foo} is even better. This also generates no diffs when running > make in src/backend/utils/mb/Unicode/. So no objections to that. Thank you for the explanation. I think no '$$'s is left alone. > > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch > > > > Before adding radix tree stuff, applied pgperltidy and inserted > > format-skipping pragma for the parts where perltidy seems to do > > too much. > > Which version of perltidy did you use? Looking at the archives, the > perl code is cleaned up with a specific version, v20090616. See > https://www.postgresql.org/message-id/20151204054322.ga2070...@tornado.leadboat.com > for example on the matter. As perltidy changes over time, this may be > a sensitive change if done this way. Hmm. I will make a confirmation on that.. tomorrow. > > 0004-Use-radix-tree-for-character-conversion.patch > > > > Radix tree body. > > Well, here a lot of diffs could have been saved. > > > The unattached fifth patch is generated by the following steps. > > > > [$(TOP)]$ ./configure > > [Unicode]$ make > > [Unicode]$ make distclean > > [Unicode]$ git add . > > [Unicode]$ commit > > === COMMITE MESSSAGE > > Replace map files with radix tree files. > > > > These encodings no longer uses the former map files and uses new radix > > tree files. > > === > > OK, I can see that working, with 200k of maps generated.. So going > through the important bits of this jungle.. Many thaks for the exploration. > +/* > + * radix tree conversion function - this should be identical to the function > in > + * ../conv.c with the same name > + */ > +static inline uint32 > +pg_mb_radix_conv(const pg_mb_radix_tree *rt, > +int l, > +unsigned char b1, > +unsigned char b2, > +unsigned char b3, > +unsigned char b4) > This is not nice. Having a duplication like that is a recipe to forget > about it as this patch introduces a dependency with conv.c and the > radix tree generation. M. I agree to you, but conv.c contains unwanted reference to elog or other sutff of the core. Separating the function in a dedicate source file named such as "../pg_mb_radix_conv.c" will work. If it is not so bad, I'll do that in the next version. > Having a .gitignore in Unicode/ would be nice, particularly to avoid > committing map_checker. > > A README documenting things may be welcome, or at least comments at > the top of map_checker.c. Why is map_checker essential? What does it > do? There is no way to understand that easily, except that it includes > a "radix tree conversion function", and that it performs sanity checks > on the radix trees to be sure that they are on a good shape. But as > this something that one would guess only after looking at your patch > and the code (at least I will sleep less stupid tonight after reading > this stuff). Okay, I'll do that. > --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl > +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl > # Drop these SJIS codes from the source for UTF8=>SJIS conversion > #<<< do not let perltidy touch this > -my @reject_sjis =( > +my @reject_sjis = ( > 0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782, >
Re: [HACKERS] Radix tree for character conversion
At Thu, 26 Jan 2017 16:29:10 +0900, Michael Paquierwrote in > On Wed, Jan 25, 2017 at 7:18 PM, Ishii Ayumi wrote: > > I patched 4 patchset and run "make", but I got failed. > > Is this a bug or my mistake ? > > I'm sorry if I'm wrong. > > > > [$(TOP)]$ patch -p1 < ../0001-Add-missing-semicolon.patch > > [$(TOP)]$ patch -p1 < ../0002-Correct-reference-resolution-syntax.patch > > [$(TOP)]$ patch -p1 < > > ../0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch > > [$(TOP)]$ patch -p1 < ../0004-Use-radix-tree-for-character-conversion.patch > > [$(TOP)]$ ./configure > > [Unicode]$ make > > '/usr/bin/perl' UCS_to_most.pl > > Type of arg 1 to keys must be hash (not hash element) at convutils.pm > > line 443, near "}) > > " > > Type of arg 1 to values must be hash (not hash element) at > > convutils.pm line 596, near "}) > > " > > Type of arg 1 to each must be hash (not private variable) at > > convutils.pm line 755, near "$map) > > " > > Compilation failed in require at UCS_to_most.pl line 19. > > make: *** [iso8859_2_to_utf8.map] Error 255 > > Hm, I am not sure what you are missing. I was able to get things to build. As I posted, it should be caused by older perl, at least 5.8 complains so and 5.16 doesn't. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On Wed, Jan 25, 2017 at 7:18 PM, Ishii Ayumiwrote: > I patched 4 patchset and run "make", but I got failed. > Is this a bug or my mistake ? > I'm sorry if I'm wrong. > > [$(TOP)]$ patch -p1 < ../0001-Add-missing-semicolon.patch > [$(TOP)]$ patch -p1 < ../0002-Correct-reference-resolution-syntax.patch > [$(TOP)]$ patch -p1 < > ../0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch > [$(TOP)]$ patch -p1 < ../0004-Use-radix-tree-for-character-conversion.patch > [$(TOP)]$ ./configure > [Unicode]$ make > '/usr/bin/perl' UCS_to_most.pl > Type of arg 1 to keys must be hash (not hash element) at convutils.pm > line 443, near "}) > " > Type of arg 1 to values must be hash (not hash element) at > convutils.pm line 596, near "}) > " > Type of arg 1 to each must be hash (not private variable) at > convutils.pm line 755, near "$map) > " > Compilation failed in require at UCS_to_most.pl line 19. > make: *** [iso8859_2_to_utf8.map] Error 255 Hm, I am not sure what you are missing. I was able to get things to build. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On Tue, Jan 10, 2017 at 8:22 PM, Kyotaro HORIGUCHIwrote: > [...patch...] Nobody has showed up yet to review this patch, so I am giving it a shot. The patch file sizes are scary at first sight, but after having a look: 36 files changed, 1411 insertions(+), 54398 deletions(-) Yes that's a surprise, something like git diff --irreversible-delete would have helped as most of the diffs are just caused by 3 files being deleted in patch 0004, making 50k lines going to the abyss of deletion. > Hello, I found a bug in my portion while rebasing. Right, that's 0001. Nice catch. > The attached files are the following. This patchset is not > complete missing changes of map files. The change is tremendously > large but generatable. > > 0001-Add-missing-semicolon.patch > > UCS_to_EUC_JP.pl has a line missing teminating semicolon. This > doesn't harm but surely a syntax error. This patch fixes it. > This might should be a separate patch. This requires a back-patch. This makes me wonder how long this script has actually not run... > 0002-Correct-reference-resolution-syntax.patch > > convutils.pm has lines with different syntax of reference > resolution. This unifies the syntax. Yes that looks right to me. I am the best perl guru on this list but looking around $$var{foo} is bad, ${$var}{foo} is better, and $var->{foo} is even better. This also generates no diffs when running make in src/backend/utils/mb/Unicode/. So no objections to that. > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch > > Before adding radix tree stuff, applied pgperltidy and inserted > format-skipping pragma for the parts where perltidy seems to do > too much. Which version of perltidy did you use? Looking at the archives, the perl code is cleaned up with a specific version, v20090616. See https://www.postgresql.org/message-id/20151204054322.ga2070...@tornado.leadboat.com for example on the matter. As perltidy changes over time, this may be a sensitive change if done this way. > 0004-Use-radix-tree-for-character-conversion.patch > > Radix tree body. Well, here a lot of diffs could have been saved. > The unattached fifth patch is generated by the following steps. > > [$(TOP)]$ ./configure > [Unicode]$ make > [Unicode]$ make distclean > [Unicode]$ git add . > [Unicode]$ commit > === COMMITE MESSSAGE > Replace map files with radix tree files. > > These encodings no longer uses the former map files and uses new radix > tree files. > === OK, I can see that working, with 200k of maps generated.. So going through the important bits of this jungle.. +/* + * radix tree conversion function - this should be identical to the function in + * ../conv.c with the same name + */ +static inline uint32 +pg_mb_radix_conv(const pg_mb_radix_tree *rt, +int l, +unsigned char b1, +unsigned char b2, +unsigned char b3, +unsigned char b4) This is not nice. Having a duplication like that is a recipe to forget about it as this patch introduces a dependency with conv.c and the radix tree generation. Having a .gitignore in Unicode/ would be nice, particularly to avoid committing map_checker. A README documenting things may be welcome, or at least comments at the top of map_checker.c. Why is map_checker essential? What does it do? There is no way to understand that easily, except that it includes a "radix tree conversion function", and that it performs sanity checks on the radix trees to be sure that they are on a good shape. But as this something that one would guess only after looking at your patch and the code (at least I will sleep less stupid tonight after reading this stuff). --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl # Drop these SJIS codes from the source for UTF8=>SJIS conversion #<<< do not let perltidy touch this -my @reject_sjis =( +my @reject_sjis = ( 0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782, - 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, + 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, 0x879a..0x879c -); + ); This is not generated, it would be nice to drop the noise from the patch. Here is another one: - $i->{code} = $jis | ( - $jis < 0x100 - ? 0x8e00 - : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080)); - +#<<< do not let perltidy touch this + $i->{code} = $jis | ($jis < 0x100 ? 0x8e00: +($sjis >= 0xeffd ? 0x8f8080 : 0x8080)); +#>>> if (l == 2) { - iutf = *utf++ << 8; - iutf |= *utf++; + b3 = *utf++; + b4 = *utf++; } Ah, OK. This conversion is important so as it performs a minimum of bitwise operations. Yes let's keep that. That's pretty cool to get a faster operation. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your
Re: [HACKERS] Radix tree for character conversion
HI, I patched 4 patchset and run "make", but I got failed. Is this a bug or my mistake ? I'm sorry if I'm wrong. [$(TOP)]$ patch -p1 < ../0001-Add-missing-semicolon.patch [$(TOP)]$ patch -p1 < ../0002-Correct-reference-resolution-syntax.patch [$(TOP)]$ patch -p1 < ../0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch [$(TOP)]$ patch -p1 < ../0004-Use-radix-tree-for-character-conversion.patch [$(TOP)]$ ./configure [Unicode]$ make '/usr/bin/perl' UCS_to_most.pl Type of arg 1 to keys must be hash (not hash element) at convutils.pm line 443, near "}) " Type of arg 1 to values must be hash (not hash element) at convutils.pm line 596, near "}) " Type of arg 1 to each must be hash (not private variable) at convutils.pm line 755, near "$map) " Compilation failed in require at UCS_to_most.pl line 19. make: *** [iso8859_2_to_utf8.map] Error 255 Regars, -- Ayumi Ishii -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Hello, I looked on this closer. The attached is the revised version of this patch. At Mon, 05 Dec 2016 19:29:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20161205.192954.12189.horiguchi.kyot...@lab.ntt.co.jp> > Apart from the aboves, I have some trivial comments on the new > version. > > > 1. If we decide not to use old-style maps, UtfToLocal no longer > need to take void * as map data. (Patch 0001) > 2. "use Data::Dumper" doesn't seem necessary. (Patch 0002) > 3. A comment contains a superfluous comma. (Patch 0002) (The last >byte of the first line below) > 4. The following code doesn't seem so perl'ish. > 4. download_srctxts.sh is no longer needed. (No patch) 6. Fixed some inconsistent indentation/folding. 7. Fix handling of $verbose. 8. Sort segments using leading bytes. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile index 0345a36..f184f65 100644 --- a/src/backend/utils/mb/Unicode/Makefile +++ b/src/backend/utils/mb/Unicode/Makefile @@ -158,9 +158,6 @@ gb-18030-2000.xml windows-949-2000.xml: euc-jis-2004-std.txt sjis-0213-2004-std.txt: $(DOWNLOAD) http://x0213.org/codetable/$(@F) -gb-18030-2000.xml windows-949-2000.xml: - $(DOWNLOAD) https://ssl.icu-project.org/repos/icu/data/trunk/charset/data/xml/$(@F) - GB2312.TXT: $(DOWNLOAD) 'http://trac.greenstone.org/browser/trunk/gsdl/unicode/MAPPINGS/EASTASIA/GB/GB2312.TXT?rev=1842=txt' @@ -176,7 +173,7 @@ KOI8-R.TXT KOI8-U.TXT: $(ISO8859TEXTS): $(DOWNLOAD) http://ftp.unicode.org/Public/MAPPINGS/ISO8859/$(@F) -$(filter-out CP8%,$(WINTEXTS)) CP932.TXT CP950.TXT: +$(filter-out CP8%,$(WINTEXTS)) $(filter CP9%, $(SPECIALTEXTS)): $(DOWNLOAD) http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/$(@F) $(filter CP8%,$(WINTEXTS)): diff --git a/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl b/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl index 822ab28..62e5145 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl @@ -51,7 +51,9 @@ foreach my $i (@$cp950txt) { code => $code, ucs => $ucs, comment => $i->{comment}, - direction => "both" }; + direction => "both", + f => $i->{f}, + l => $i->{l} }; } } @@ -70,6 +72,6 @@ foreach my $i (@$all) } # Output -print_tables($this_script, "BIG5", $all); +print_tables($this_script, "BIG5", $all, 1); print_radix_trees($this_script, "BIG5", $all); diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl index a933c12..299beec 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl @@ -72,9 +72,11 @@ while (<$in>) push @mapping, { ucs => $ucs, code => $code, - direction => 'both' }; + direction => 'both', + f => $in_file, + l => $. }; } close($in); -print_tables($this_script, "EUC_CN", \@mapping); +print_tables($this_script, "EUC_CN", \@mapping, 1); print_radix_trees($this_script, "EUC_CN", \@mapping); diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl index 1bf7f2e..fea03df 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl @@ -31,12 +31,14 @@ while (my $line = <$in>) my $ucs1 = hex($u1); my $ucs2 = hex($u2); - push @all, { direction => 'both', - ucs => $ucs1, - ucs_second => $ucs2, - code => $code, - comment => $rest }; - next; + push @all, + { direction => 'both', + ucs=> $ucs1, + ucs_second => $ucs2, + code => $code, + comment=> $rest, + f => $in_file, + l => $. }; } elsif ($line =~ /^0x(.*)[ \t]*U\+(.*)[ \t]*#(.*)$/) { @@ -47,7 +49,13 @@ while (my $line = <$in>) next if ($code < 0x80 && $ucs < 0x80); - push @all, { direction => 'both', ucs => $ucs, code => $code, comment => $rest }; + push @all, + { direction => 'both', + ucs => $ucs, + code => $code, + comment => $rest, + f => $in_file, + l => $. }; } } close($in); diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl index 5ac3542..9dcb9e2 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl @@ -108,98 +108,98 @@ foreach my $i (@mapping) } push @mapping, ( - {direction => 'both', ucs => 0x4efc, code => 0x8ff4af, comment => '# CJK(4EFC)'}, - {direction => 'both', ucs => 0x50f4, code => 0x8ff4b0, comment => '# CJK(50F4)'}, - {direction => 'both', ucs => 0x51EC, code => 0x8ff4b1, comment => '# CJK(51EC)'}, - {direction => 'both', ucs => 0x5307, code => 0x8ff4b2, comment => '# CJK(5307)'}, - {direction => 'both', ucs => 0x5324, code => 0x8ff4b3, comment => '#
Re: [HACKERS] Radix tree for character conversion
Hello, thank you for reviewing this. I compared mine and yours. The new patch works fine and gives smaller radix map files. It seems also to me more readable. At Fri, 2 Dec 2016 22:07:07 +0200, Heikki Linnakangaswrote in > On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote: > > Thanks. The attached patch contains the patch by perlcritic. > > > > 0001,2,3 are Heikki's patch that are not modified since it is > > first proposed. It's a bit too big so I don't attach them to this > > mail (again). > > > > https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi > > I've now pushed these preliminary patches, with the applicable fixes > from you and Daniel. The attached patch is now against git master. Thanks for committing them. > > 0004 is radix-tree stuff, applies on top of the three patches > > above. > > I've spent the last couple of days reviewing this. While trying to > understand how it works, I ended up dismantling, rewriting, and > putting back together most of the added perl code. I might have been putting too much in one structure and a bit too eager to conceal lower level from the upper level. > Attached is a new > version, with more straightforward logic, making it more > understandable. I find it more understandable, anyway, I hope it's not > only because I wrote it myself :-). Let me know what you think. First, thank you for the refactoring(?). I didn't intend to replace all of .map files with radix files at first. Finally my patch removes all old-style map file but I haven't noticed that. So removing the old bsearch code seems reasonable. Avoiding redundant decomposition of multibyte characters into bytes seems reasonable from the view of efficiency. The new patch decomposes the structured pg_mb_radix_tree into a series of (basically) plain member variables in a struct. I'm not so in favor of the style but a radix tree of at least 4-levels is easily read in the style and maybe the code to handle it is rather easily readable. (So +1 for it) > In particular, I found the explanations of flat and segmented tables > really hard to understand. So in this version, the radix trees for a > conversion are stored completely in one large array. Leaf and > intermediate levels are all in the same array. When reading this > version, please note that I'm not sure if I mean the same thing with > "segment" that you did in your version. > I moved the "lower" and "upper" values in the structs. Also, there are > now also separate "lower" and "upper" values for the leaf levels of > the trees, for 1- 2-, 3- and 4-byte inputs. This made a huge The "segment" there seems to mean definitely the same to mine. Flattening the on-memory structure is fine from the same reason to the above. > difference to the size of gb18030_to_utf8_radix.map, in particular: > the source file shrank from about 2 MB to 1/2 MB. In that conversion, > the valid range for the last byte of 2-byte inputs is 0x40-0xfe, and > the valid range for the last byte of 4-byte inputs is 0x30-0x39. With > the old patch version, the "chars" range was therefore 0x30-0xfe, to > cover both of those, and most of the array was filled with zeros. With > this new patch version, we store separate ranges for those, and can > leave out most of the zeros. Great. I agree that the (logically) devided chartable is significantly space-efficient. > There's a segment full of zeros at the beginning of each conversion > array now. The purpose of that is that when traversing the radix tree, > you don't need to check each intermediate value for 0. If you follow a > 0 offset, it simply points to the dummy all-zeros segments in the > beginning. Seemed like a good idea to shave some cycles, although I'm > not sure if it made much difference in reality. And I like the zero page. > I optimized pg_mb_radix_conv() a bit, too. We could do more. For > example, I think it would save some cycles to have specialized > versions of UtfToLocal and LocalToUtf, moving the tests for whether a > combined character map and/or conversion callback is used, out of the > loop. They feel a bit ugly too, in their current form... Hmm. Maybe decomposing iiso in pg_mb_radix_conv is faster than pushing extra 3 (or 4) parameters into the stack (or it's wrong if they are passed using registers?) but I'm not sure. > I need a break now, but I'll try to pick this up again some time next > week. Meanwhile, please have a look and tell me what you think. Thank you very much for the big effort on this patch. Apart from the aboves, I have some trivial comments on the new version. 1. If we decide not to use old-style maps, UtfToLocal no longer need to take void * as map data. (Patch 0001) 2. "use Data::Dumper" doesn't seem necessary. (Patch 0002) 3. A comment contains a superfluous comma. (Patch 0002) (The last byte of the first line below) > ### The segments are written out physically to one big array
Re: [HACKERS] Radix tree for character conversion
On 12/02/2016 10:18 PM, Alvaro Herrera wrote: Heikki Linnakangas wrote: On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote: Thanks. The attached patch contains the patch by perlcritic. 0001,2,3 are Heikki's patch that are not modified since it is first proposed. It's a bit too big so I don't attach them to this mail (again). https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi I've now pushed these preliminary patches, with the applicable fixes from you and Daniel. The attached patch is now against git master. Is this the Nov. 30th commit? Because I don't see any other commits from you. Yes. Sorry for the confusion. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Heikki Linnakangas wrote: > On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote: > > Thanks. The attached patch contains the patch by perlcritic. > > > > 0001,2,3 are Heikki's patch that are not modified since it is > > first proposed. It's a bit too big so I don't attach them to this > > mail (again). > > > > https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi > > I've now pushed these preliminary patches, with the applicable fixes from > you and Daniel. The attached patch is now against git master. Is this the Nov. 30th commit? Because I don't see any other commits from you. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote: Thanks. The attached patch contains the patch by perlcritic. 0001,2,3 are Heikki's patch that are not modified since it is first proposed. It's a bit too big so I don't attach them to this mail (again). https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi I've now pushed these preliminary patches, with the applicable fixes from you and Daniel. The attached patch is now against git master. 0004 is radix-tree stuff, applies on top of the three patches above. I've spent the last couple of days reviewing this. While trying to understand how it works, I ended up dismantling, rewriting, and putting back together most of the added perl code. Attached is a new version, with more straightforward logic, making it more understandable. I find it more understandable, anyway, I hope it's not only because I wrote it myself :-). Let me know what you think. In particular, I found the explanations of flat and segmented tables really hard to understand. So in this version, the radix trees for a conversion are stored completely in one large array. Leaf and intermediate levels are all in the same array. When reading this version, please note that I'm not sure if I mean the same thing with "segment" that you did in your version. I moved the "lower" and "upper" values in the structs. Also, there are now also separate "lower" and "upper" values for the leaf levels of the trees, for 1- 2-, 3- and 4-byte inputs. This made a huge difference to the size of gb18030_to_utf8_radix.map, in particular: the source file shrank from about 2 MB to 1/2 MB. In that conversion, the valid range for the last byte of 2-byte inputs is 0x40-0xfe, and the valid range for the last byte of 4-byte inputs is 0x30-0x39. With the old patch version, the "chars" range was therefore 0x30-0xfe, to cover both of those, and most of the array was filled with zeros. With this new patch version, we store separate ranges for those, and can leave out most of the zeros. There's a segment full of zeros at the beginning of each conversion array now. The purpose of that is that when traversing the radix tree, you don't need to check each intermediate value for 0. If you follow a 0 offset, it simply points to the dummy all-zeros segments in the beginning. Seemed like a good idea to shave some cycles, although I'm not sure if it made much difference in reality. I optimized pg_mb_radix_conv() a bit, too. We could do more. For example, I think it would save some cycles to have specialized versions of UtfToLocal and LocalToUtf, moving the tests for whether a combined character map and/or conversion callback is used, out of the loop. They feel a bit ugly too, in their current form... I need a break now, but I'll try to pick this up again some time next week. Meanwhile, please have a look and tell me what you think. - Heikki diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile index ea21f4a..f184f65 100644 --- a/src/backend/utils/mb/Unicode/Makefile +++ b/src/backend/utils/mb/Unicode/Makefile @@ -40,23 +40,30 @@ WINMAPS = win866_to_utf8.map utf8_to_win866.map \ GENERICMAPS = $(ISO8859MAPS) $(WINMAPS) \ gbk_to_utf8.map utf8_to_gbk.map \ - koi8r_to_utf8.map utf8_to_koi8r.map + koi8r_to_utf8.map utf8_to_koi8r.map \ + koi8u_to_utf8.map utf8_to_koi8u.map SPECIALMAPS = euc_cn_to_utf8.map utf8_to_euc_cn.map \ euc_jp_to_utf8.map utf8_to_euc_jp.map \ + euc_jis_2004_to_utf8.map utf8_to_euc_jis_2004.map \ euc_kr_to_utf8.map utf8_to_euc_kr.map \ euc_tw_to_utf8.map utf8_to_euc_tw.map \ sjis_to_utf8.map utf8_to_sjis.map \ + shift_jis_2004_to_utf8.map utf8_to_shift_jis_2004.map \ gb18030_to_utf8.map utf8_to_gb18030.map \ big5_to_utf8.map utf8_to_big5.map \ johab_to_utf8.map utf8_to_johab.map \ uhc_to_utf8.map utf8_to_uhc.map \ - euc_jis_2004_to_utf8.map euc_jis_2004_to_utf8_combined.map \ + +COMBINEDMAPS = euc_jis_2004_to_utf8.map euc_jis_2004_to_utf8_combined.map \ utf8_to_euc_jis_2004.map utf8_to_euc_jis_2004_combined.map \ shift_jis_2004_to_utf8.map shift_jis_2004_to_utf8_combined.map \ utf8_to_shift_jis_2004.map utf8_to_shift_jis_2004_combined.map -MAPS = $(GENERICMAPS) $(SPECIALMAPS) +MAPS = $(GENERICMAPS) $(SPECIALMAPS) $(COMBINEDMAPS) + +RADIXGENERICMAPS = $(subst .map,_radix.map,$(GENERICMAPS)) +RADIXMAPS = $(subst .map,_radix.map,$(GENERICMAPS) $(SPECIALMAPS)) ISO8859TEXTS = 8859-2.TXT 8859-3.TXT 8859-4.TXT 8859-5.TXT \ 8859-6.TXT 8859-7.TXT 8859-8.TXT 8859-9.TXT \ @@ -68,53 +75,76 @@ WINTEXTS = CP866.TXT CP874.TXT CP936.TXT \ CP1252.TXT CP1253.TXT CP1254.TXT CP1255.TXT \ CP1256.TXT CP1257.TXT CP1258.TXT +SPECIALTEXTS = BIG5.TXT CNS11643.TXT \ + CP932.TXT CP950.TXT \ + JIS0201.TXT JIS0208.TXT JIS0212.TXT SHIFTJIS.TXT \ + JOHAB.TXT KSX1001.TXT windows-949-2000.xml \ + euc-jis-2004-std.txt sjis-0213-2004-std.txt \ + gb-18030-2000.xml + GENERICTEXTS = $(ISO8859TEXTS)
Re: [HACKERS] Radix tree for character conversion
On 10/31/2016 06:11 PM, Daniel Gustafsson wrote: On 27 Oct 2016, at 09:23, Kyotaro HORIGUCHIwrote: At Tue, 25 Oct 2016 12:23:48 +0300, Heikki Linnakangas wrote in <08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi> [..] The perl scripts are still quite messy. For example, I lost the checks for duplicate mappings somewhere along the way - that ought to be put back. My Perl skills are limited. Perl scripts are to be messy, I believe. Anyway the duplicate check as been built into the sub print_radix_trees. Maybe the same check is needed by some plain map files but it would be just duplication for the maps having radix tree. I took a small stab at doing some cleaning of the Perl scripts, mainly around using the more modern (well, modern as in +15 years old) form for open(..), avoiding global filehandles for passing scalar references and enforcing use strict. Some smaller typos and fixes were also included. It seems my Perl has become a bit rusty so I hope the changes make sense. The produced files are identical with these patches applied, they are merely doing cleaning as opposed to bugfixing. The attached patches are against the 0001-0006 patches from Heikki and you in this series of emails, the separation is intended to make them easier to read. Thanks! Patches 0001-0003 seem to have been mostly unchanged for the later discussion and everyone seems to be happy with those patches, so I picked the parts of these cleanups of yours that applied to my patches 0001-0003, and pushed those. I'll continue reviewing the rest.. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Hello. I'll be off line until at least next Monday. So I move this to the next CF by myself. At Wed, 09 Nov 2016 17:38:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20161109.173853.77274443.horiguchi.kyot...@lab.ntt.co.jp> > Hello, thank you for polishing this. > > At Wed, 9 Nov 2016 02:19:01 +0100, Daniel Gustafsson wrote > in <80f34f25-bf6d-4bcd-9c38-42ed10d3f...@yesql.se> > > > On 08 Nov 2016, at 17:37, Peter Eisentraut > > > wrote: > > > > > > On 10/31/16 12:11 PM, Daniel Gustafsson wrote: > > >> I took a small stab at doing some cleaning of the Perl scripts, mainly > > >> around > > >> using the more modern (well, modern as in +15 years old) form for > > >> open(..), > > >> avoiding global filehandles for passing scalar references and enforcing > > >> use > > >> strict. Some smaller typos and fixes were also included. It seems my > > >> Perl has > > >> become a bit rusty so I hope the changes make sense. The produced files > > >> are > > >> identical with these patches applied, they are merely doing cleaning as > > >> opposed > > >> to bugfixing. > > >> > > >> The attached patches are against the 0001-0006 patches from Heikki and > > >> you in > > >> this series of emails, the separation is intended to make them easier to > > >> read. > > > > > > Cool. See also here: > > > https://www.postgresql.org/message-id/55E52225.4040305%40gmx.net > > > Nice, not having hacked much Perl in quite a while I had all but forgotten > > about perlcritic. > > I tried it on CentOS7. Installation failed saying that > Module::Build is too old. It is yum-inatlled so removed it and > installed it with CPAN. Again failed with many 'Could not create > MYMETA files'. Then tried to install CPAN::Meta and it failed > saying that CPAN::Meta::YAML is too *new*. That sucks. > > So your patch is greately helpfull. Thank you. > > | -my @mapnames = map { s/\.map//; $_ } values %plainmaps; > | +my @mapnames = map { my $m = $_; $m =~ s/\.map//; $m } values %plainmaps; > > It surprised me to know that perlcritic does such things. > > > Running it on the current version of the patchset yields mostly warnings on > > string values used in the require “convutils.pm” statement. There were > > however > > two more interesting reports: one more open() call not using the three > > parameter form and an instance of map which alters the input value. > > Sorry for overlooking it. > > > The latter > > is not causing an issue since we don’t use the input list past the map but > > fixing it seems like good form. > > Agreed. > > > Attached is a patch that addresses the perlcritic reports (running without > > any > > special options). > > Thanks. The attached patch contains the patch by perlcritic. > > 0001,2,3 are Heikki's patch that are not modified since it is > first proposed. It's a bit too big so I don't attach them to this > mail (again). > > https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi > > 0004 is radix-tree stuff, applies on top of the three patches > above. > > There's a hidden fifth patch which of 20MB in size. But it is > generated by running make in the Unicode directory. > > [$(TOP)]$ ./configure ... > [$(TOP)]$ make > [Unicode]$ make > [Unicode]$ make distclean > [Unicode]$ git add . > [Unicode]$ commit > === COMMITE MESSSAGE > Replace map files with radix tree files. > > These encodings no longer uses the former map files and uses new radix > tree files. All existing authority files in this directory are removed. > === -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Hello, thank you for polishing this. At Wed, 9 Nov 2016 02:19:01 +0100, Daniel Gustafssonwrote in <80f34f25-bf6d-4bcd-9c38-42ed10d3f...@yesql.se> > > On 08 Nov 2016, at 17:37, Peter Eisentraut > > wrote: > > > > On 10/31/16 12:11 PM, Daniel Gustafsson wrote: > >> I took a small stab at doing some cleaning of the Perl scripts, mainly > >> around > >> using the more modern (well, modern as in +15 years old) form for open(..), > >> avoiding global filehandles for passing scalar references and enforcing use > >> strict. Some smaller typos and fixes were also included. It seems my > >> Perl has > >> become a bit rusty so I hope the changes make sense. The produced files > >> are > >> identical with these patches applied, they are merely doing cleaning as > >> opposed > >> to bugfixing. > >> > >> The attached patches are against the 0001-0006 patches from Heikki and you > >> in > >> this series of emails, the separation is intended to make them easier to > >> read. > > > > Cool. See also here: > > https://www.postgresql.org/message-id/55E52225.4040305%40gmx.net > Nice, not having hacked much Perl in quite a while I had all but forgotten > about perlcritic. I tried it on CentOS7. Installation failed saying that Module::Build is too old. It is yum-inatlled so removed it and installed it with CPAN. Again failed with many 'Could not create MYMETA files'. Then tried to install CPAN::Meta and it failed saying that CPAN::Meta::YAML is too *new*. That sucks. So your patch is greately helpfull. Thank you. | -my @mapnames = map { s/\.map//; $_ } values %plainmaps; | +my @mapnames = map { my $m = $_; $m =~ s/\.map//; $m } values %plainmaps; It surprised me to know that perlcritic does such things. > Running it on the current version of the patchset yields mostly warnings on > string values used in the require “convutils.pm” statement. There were > however > two more interesting reports: one more open() call not using the three > parameter form and an instance of map which alters the input value. Sorry for overlooking it. > The latter > is not causing an issue since we don’t use the input list past the map but > fixing it seems like good form. Agreed. > Attached is a patch that addresses the perlcritic reports (running without any > special options). Thanks. The attached patch contains the patch by perlcritic. 0001,2,3 are Heikki's patch that are not modified since it is first proposed. It's a bit too big so I don't attach them to this mail (again). https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi 0004 is radix-tree stuff, applies on top of the three patches above. There's a hidden fifth patch which of 20MB in size. But it is generated by running make in the Unicode directory. [$(TOP)]$ ./configure ... [$(TOP)]$ make [Unicode]$ make [Unicode]$ make distclean [Unicode]$ git add . [Unicode]$ commit === COMMITE MESSSAGE Replace map files with radix tree files. These encodings no longer uses the former map files and uses new radix tree files. All existing authority files in this directory are removed. === regards, >From e6718001cbe3f1937e4f052c66bff68f7217c43c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 27 Oct 2016 14:19:47 +0900 Subject: [PATCH 4/5] Make map generators to generate radix tree files. This introduces radix tree conversion to map-referring(non-arithmetic) code conversions. UCS_to*.pl files generate _radix.map files addition to .map files currently generated. These _radix.map files are referenced by conversion procs instead of .map files. Radix trees are not easily verified, so a checker is provided. "make mapcheck" builds and runs it. It verifies radix maps against corresponding .map files. Now 'make distclean' removes authority files and unused .map files since they should not be contained in source archive. On the other hand 'make maintainer-clean' leaves them and removes all map files. This seems somewhat strange but it comes from the special characteristics of this directory. All perl scripts turned into modern style. Use strict, modern usage of file handles and accept reindentation by pgperltidy. Perl scripts in this commit have been applied it. For the first time running "make all", some UCS_to*.pl scripts many times because of a limitation of gmake's capability. --- src/backend/utils/mb/Unicode/Makefile | 78 +- src/backend/utils/mb/Unicode/UCS_to_BIG5.pl| 33 +- src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl | 36 +- .../utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl| 69 +- src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl | 42 +- src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl | 12 +- src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl | 23 +- src/backend/utils/mb/Unicode/UCS_to_GB18030.pl | 32 +- src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl | 10 +-
Re: [HACKERS] Radix tree for character conversion
> On 08 Nov 2016, at 17:37, Peter Eisentraut> wrote: > > On 10/31/16 12:11 PM, Daniel Gustafsson wrote: >> I took a small stab at doing some cleaning of the Perl scripts, mainly around >> using the more modern (well, modern as in +15 years old) form for open(..), >> avoiding global filehandles for passing scalar references and enforcing use >> strict. Some smaller typos and fixes were also included. It seems my Perl >> has >> become a bit rusty so I hope the changes make sense. The produced files are >> identical with these patches applied, they are merely doing cleaning as >> opposed >> to bugfixing. >> >> The attached patches are against the 0001-0006 patches from Heikki and you in >> this series of emails, the separation is intended to make them easier to >> read. > > Cool. See also here: > https://www.postgresql.org/message-id/55E52225.4040305%40gmx.net Nice, not having hacked much Perl in quite a while I had all but forgotten about perlcritic. Running it on the current version of the patchset yields mostly warnings on string values used in the require “convutils.pm” statement. There were however two more interesting reports: one more open() call not using the three parameter form and an instance of map which alters the input value. The latter is not causing an issue since we don’t use the input list past the map but fixing it seems like good form. Attached is a patch that addresses the perlcritic reports (running without any special options). cheers ./daniel fix_perlcritic.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On 10/31/16 12:11 PM, Daniel Gustafsson wrote: > I took a small stab at doing some cleaning of the Perl scripts, mainly around > using the more modern (well, modern as in +15 years old) form for open(..), > avoiding global filehandles for passing scalar references and enforcing use > strict. Some smaller typos and fixes were also included. It seems my Perl > has > become a bit rusty so I hope the changes make sense. The produced files are > identical with these patches applied, they are merely doing cleaning as > opposed > to bugfixing. > > The attached patches are against the 0001-0006 patches from Heikki and you in > this series of emails, the separation is intended to make them easier to read. Cool. See also here: https://www.postgresql.org/message-id/55E52225.4040305%40gmx.net -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
> On 08 Nov 2016, at 12:21, Kyotaro HORIGUCHI> wrote: > > Hello, this is the revising patch applies on top of the previous > patch. > > ... > > Finally the attached patch contains most of (virtually all of) > Daniel's suggestion and some modification by pgperltidy. Reading over this it looks good to me. I did spot one thing I had missed before though, the error message below should be referencing the scalar variable ‘direction' unless I’m missing something: - die "unacceptable direction : %direction" + die "unacceptable direction : $direction" if ($direction ne "to_unicode" && $direction ne "from_unicode"); With this, I would consider this ready for committer. >> Addition to this, I'll remove existing authority files and modify >> radix generator so that it can read plain map files in the next >> patch. > > So, I think the attached are in rather modern shape. +1, nice work! cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Hello, this is the revising patch applies on top of the previous patch. Differences on map files are enormous but useless for discussion so they aren't included in this. (but can be generated) This still doesn't remove three .txt/.xml files since it heavily bloats the patch. I'm planning that they are removed in the final shape. All authority files including the removed files are automatically downloaded by the Makefile in this patch. At Tue, 08 Nov 2016 10:43:56 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20161108.104356.265607041.horiguchi.kyot...@lab.ntt.co.jp> > https://www.postgresql.org/docs/devel/static/install-requirements.html > > | Perl 5.8 or later is needed to build from a Git checkout, or if > | you changed the input files for any of the build steps that use > | Perl scripts. If building on Windows you will need Perl in any > | case. Perl is also required to run some test suites. > > So, we should assume Perl 5.8 (released in 2002!) on build > time. And actually 5.10 on RedHat 6.4, 5.16 on my > environment(ContOS 7.2), and the official doc is at 5.24. Active > perl is 5.24. According to this, we should use syntax supported > as of 5.8 and/but not obsolete until 5.24, then to follow the > latest convention. But not OO. (But I can't squeeze out a > concrete syntax set out of this requirements :( ) ...(forget this for a while..) Finally the attached patch contains most of (virtually all of) Daniel's suggestion and some modification by pgperltidy. > Addition to this, I'll remove existing authority files and modify > radix generator so that it can read plain map files in the next > patch. So, I think the attached are in rather modern shape. At Tue, 08 Nov 2016 11:02:58 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI wrote in <20161108.110258.59832499.horiguchi.kyot...@lab.ntt.co.jp> > Hmm. Somehow perl-mode on my Emacs is stirring with > ununderstandable indentation and I manually correct them so it is > highly probable that the style of this patch is not compatible > with the defined style. Anyway it is better that pgindent > generates smaller patch so I'll try it. The attached are applied pgperltidy. Several regions such like additional character list are marked not to be edited. One concern is what to leave by 'make distclen' and 'make maintainer-clean'. The former should remove authority *.TXT files since it shouldn't be in source archive. On the other hand it is more convenient that the latter leaves them. This seems somewhat strange but I can't come up with better behavior for now. regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From 02a19deb3eb74069936ced0dbea4693322ad9ec8 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 8 Nov 2016 18:22:36 +0900 Subject: [PATCH 1/2] Edit perl scripts into modern style Part of existing and added scripts are in obsolete style. This path edits them into modern style as the suggestion by Daniel Gustafsson. --- src/backend/utils/mb/Unicode/UCS_to_BIG5.pl| 31 +- src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl | 33 +- .../utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl| 66 ++- src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl | 36 +- src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl | 9 +- src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl | 20 +- src/backend/utils/mb/Unicode/UCS_to_GB18030.pl | 29 +- src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl | 7 +- .../utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl | 92 ++-- src/backend/utils/mb/Unicode/UCS_to_SJIS.pl| 27 +- src/backend/utils/mb/Unicode/UCS_to_UHC.pl | 35 +- src/backend/utils/mb/Unicode/UCS_to_most.pl| 14 +- src/backend/utils/mb/Unicode/convutils.pm | 579 +++-- src/backend/utils/mb/Unicode/make_mapchecker.pl| 18 +- 14 files changed, 529 insertions(+), 467 deletions(-) diff --git a/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl b/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl index 7b14817..0412723 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl @@ -24,10 +24,10 @@ # UCS-2 code in hex # # and Unicode name (not used in this script) - +use strict; require "convutils.pm"; -$this_script = $0; +my $this_script = $0; # Load BIG5.TXT my $all = _source("BIG5.TXT"); @@ -35,9 +35,10 @@ my $all = _source("BIG5.TXT"); # Load CP950.TXT my $cp950txt = _source("CP950.TXT"); -foreach my $i (@$cp950txt) { +foreach my $i (@$cp950txt) +{ my $code = $i->{code}; - my $ucs = $i->{ucs}; + my $ucs = $i->{ucs}; # Pick only the ETEN extended characters in the range 0xf9d6 - 0xf9dc # from CP950.TXT @@ -46,20 +47,22 @@ foreach my $i (@$cp950txt) { && $code >= 0xf9d6 && $code <= 0xf9dc) { - push @$all, {code => $code, - ucs => $ucs, - comment => $i->{comment}, - direction
Re: [HACKERS] Radix tree for character conversion
Hello, At Mon, 7 Nov 2016 17:19:29 +0100, Daniel Gustafssonwrote in <39e295b9-7391-40b6-911d-fe852e460...@yesql.se> > > On 07 Nov 2016, at 12:32, Daniel Gustafsson wrote: > > > >> On 04 Nov 2016, at 08:34, Kyotaro HORIGUCHI > >> wrote: > >> > >> I'm not sure how the discussion about this goes, these patches > >> makes me think about coding style of Perl. > > > > Some of this can absolutely be considered style and more or less down to > > personal preference. I haven’t seen any coding conventions for Perl so I > > assume it’s down to consensus among the committers. > > Actually, scratch that; there is of course a perltidy profile in the pgindent > directory. I should avoid sending email before coffee.. Hmm. Somehow perl-mode on my Emacs is stirring with ununderstandable indentation and I manually correct them so it is highly probable that the style of this patch is not compatible with the defined style. Anyway it is better that pgindent generates smaller patch so I'll try it. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Hello, At Mon, 7 Nov 2016 12:32:55 +0100, Daniel Gustafssonwrote in > > On 04 Nov 2016, at 08:34, Kyotaro HORIGUCHI > > wrote: > > I'm not sure how the discussion about this goes, these patches > > makes me think about coding style of Perl. > > Some of this can absolutely be considered style and more or less down to > personal preference. I haven’t seen any coding conventions for Perl so I > assume it’s down to consensus among the committers. My rationale for these > patches in the first place was that I perceived this thread to partly want to > clean up the code and make it more modern Perl. > > > The distinction between executable script and library is by > > intention with an obscure basis. Existing scripts don't get less > > modification, but library uses more restricted scopes to get rid > > of the troubles caused by using global scopes. But I don't have a > > clear preference on that. The TAP test scripts takes OO notations > > but I'm not sure convutils.pl also be better to take the same > > notation. It would be rarely edited hereafter and won't gets > > grown any more. > > I think the current convutils module is fine and converting it to OO would be > overkill. Agreed. > > As far as I see the obvious bug fixes in the patchset are the > > following, > > Agreed, with some comments: > > > - 0007: load_maptable fogets to close input file. > > An interesting note on this is that it’s not even a bug =) Since $in is a > scalar reference, there is no need to explicitly close() the filehandle since > the reference counter will close it on leaving scope, but there’s no harm in > doing it ourselves and it also makes for less confusion for anyone not > familiar > with Perl internals. Wow. I didn't know that perl has such a hidden-OO feature. Nevertheless, implicit close is not friendly to who are not familiar with newer perl. Your comment led me to confirm the requirement to build PostgreSQL. https://www.postgresql.org/docs/devel/static/install-requirements.html | Perl 5.8 or later is needed to build from a Git checkout, or if | you changed the input files for any of the build steps that use | Perl scripts. If building on Windows you will need Perl in any | case. Perl is also required to run some test suites. So, we should assume Perl 5.8 (released in 2002!) on build time. And actually 5.10 on RedHat 6.4, 5.16 on my environment(ContOS 7.2), and the official doc is at 5.24. Active perl is 5.24. According to this, we should use syntax supported as of 5.8 and/but not obsolete until 5.24, then to follow the latest convention. But not OO. (But I can't squeeze out a concrete syntax set out of this requirements :( ) > > - 0010: commment for load_maptables is wrong. > > There is also a fix for a typo in make_mapchecker.pl > > > - 0011: hash reference is incorrectly dereferenced > > > > All other fixes other than the above three seem to be styling or > > syntax-generation issues and I don't know whether any > > recommendation exists… > > I think there are some more fixes that arent styling/syntax remaining. I’ll > go > through the patches one by one: > > 0007 - While this might be considered styling/syntax, my $0.02 is that it’s > not > and instead a worthwhile change. I’ll illustrate with an example from the > patch in question: > > Using a bareword global variable in open() for the filehandle was replaced > with > the three-part form in 5.6 and is now even actively discouraged from in the > Perl documentation (and has been so since the 5.20 docs). The problem is that > they are global and can thus easily clash, so easily that the 0007 patch > actually fixes one such occurrence: That's what should be adopted in the criteria above. - Don't use bareword globals. - Use open() with separate MODE argument. > print_radix_map() opens the file in the global filehandle OUT and passes it to > print_radix_table() with the typeglob *OUT; print_radit_table() in turn passes > the filehandle to print_segmented_table() which writes to the file using the > parameter $hd, except in one case where it uses the global OUT variable > without > knowing it will be the right file. This is where the hunk below in 0007 comes > in: > > - print OUT "$line\n"; > + print { $$hd } "$line\n"; > > In this case OUT references the right file and it produces the right result, > but it illustrates how easy it is to get wrong (which can cause very subtle > bugs). So, when poking at this code we might as well, IMHO, use what is today > in Perl considered the right way to deal with filehandle references. Thanks for the detail. Ok, I'll change the style so in the next patch. > Using implicit filemodes can also introduce bugs when opening filenames passed > in from the outside as we do in UCS_to_most.pl. Considering the use case of > these scripts it’s obviously quite low on the list
Re: [HACKERS] Radix tree for character conversion
> On 07 Nov 2016, at 12:32, Daniel Gustafssonwrote: > >> On 04 Nov 2016, at 08:34, Kyotaro HORIGUCHI >> wrote: >> >> I'm not sure how the discussion about this goes, these patches >> makes me think about coding style of Perl. > > Some of this can absolutely be considered style and more or less down to > personal preference. I haven’t seen any coding conventions for Perl so I > assume it’s down to consensus among the committers. Actually, scratch that; there is of course a perltidy profile in the pgindent directory. I should avoid sending email before coffee.. cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
> On 04 Nov 2016, at 08:34, Kyotaro HORIGUCHI> wrote: > > Thank you for looling this. And thank you for taking the time to read my patches! > At Mon, 31 Oct 2016 17:11:17 +0100, Daniel Gustafsson wrote > in <3fc648b5-2b7f-4585-9615-207a44b73...@yesql.se> >>> On 27 Oct 2016, at 09:23, Kyotaro HORIGUCHI >>> wrote: >>> Perl scripts are to be messy, I believe. Anyway the duplicate >>> check as been built into the sub print_radix_trees. Maybe the >>> same check is needed by some plain map files but it would be just >>> duplication for the maps having radix tree. >> >> I took a small stab at doing some cleaning of the Perl scripts, mainly around >> using the more modern (well, modern as in +15 years old) form for open(..), >> avoiding global filehandles for passing scalar references and enforcing use >> strict. Some smaller typos and fixes were also included. It seems my Perl >> has >> become a bit rusty so I hope the changes make sense. The produced files are >> identical with these patches applied, they are merely doing cleaning as >> opposed >> to bugfixing. >> >> The attached patches are against the 0001-0006 patches from Heikki and you in >> this series of emails, the separation is intended to make them easier to >> read. > > I'm not sure how the discussion about this goes, these patches > makes me think about coding style of Perl. Some of this can absolutely be considered style and more or less down to personal preference. I haven’t seen any coding conventions for Perl so I assume it’s down to consensus among the committers. My rationale for these patches in the first place was that I perceived this thread to partly want to clean up the code and make it more modern Perl. > The distinction between executable script and library is by > intention with an obscure basis. Existing scripts don't get less > modification, but library uses more restricted scopes to get rid > of the troubles caused by using global scopes. But I don't have a > clear preference on that. The TAP test scripts takes OO notations > but I'm not sure convutils.pl also be better to take the same > notation. It would be rarely edited hereafter and won't gets > grown any more. I think the current convutils module is fine and converting it to OO would be overkill. > As far as I see the obvious bug fixes in the patchset are the > following, Agreed, with some comments: > - 0007: load_maptable fogets to close input file. An interesting note on this is that it’s not even a bug =) Since $in is a scalar reference, there is no need to explicitly close() the filehandle since the reference counter will close it on leaving scope, but there’s no harm in doing it ourselves and it also makes for less confusion for anyone not familiar with Perl internals. > - 0010: commment for load_maptables is wrong. There is also a fix for a typo in make_mapchecker.pl > - 0011: hash reference is incorrectly dereferenced > > All other fixes other than the above three seem to be styling or > syntax-generation issues and I don't know whether any > recommendation exists… I think there are some more fixes that arent styling/syntax remaining. I’ll go through the patches one by one: 0007 - While this might be considered styling/syntax, my $0.02 is that it’s not and instead a worthwhile change. I’ll illustrate with an example from the patch in question: Using a bareword global variable in open() for the filehandle was replaced with the three-part form in 5.6 and is now even actively discouraged from in the Perl documentation (and has been so since the 5.20 docs). The problem is that they are global and can thus easily clash, so easily that the 0007 patch actually fixes one such occurrence: print_radix_map() opens the file in the global filehandle OUT and passes it to print_radix_table() with the typeglob *OUT; print_radit_table() in turn passes the filehandle to print_segmented_table() which writes to the file using the parameter $hd, except in one case where it uses the global OUT variable without knowing it will be the right file. This is where the hunk below in 0007 comes in: - print OUT "$line\n"; + print { $$hd } "$line\n"; In this case OUT references the right file and it produces the right result, but it illustrates how easy it is to get wrong (which can cause very subtle bugs). So, when poking at this code we might as well, IMHO, use what is today in Perl considered the right way to deal with filehandle references. Using implicit filemodes can also introduce bugs when opening filenames passed in from the outside as we do in UCS_to_most.pl. Considering the use case of these scripts it’s obviously quite low on the list of risks but still. 0008 - I don’t think there are any recommendations whether or not to use use strict; in the codebase, there certainly are lots of scripts not doing it. Personally I think it’s good
Re: [HACKERS] Radix tree for character conversion
Thank you for looling this. At Mon, 31 Oct 2016 17:11:17 +0100, Daniel Gustafssonwrote in <3fc648b5-2b7f-4585-9615-207a44b73...@yesql.se> > > On 27 Oct 2016, at 09:23, Kyotaro HORIGUCHI > > wrote: > > Perl scripts are to be messy, I believe. Anyway the duplicate > > check as been built into the sub print_radix_trees. Maybe the > > same check is needed by some plain map files but it would be just > > duplication for the maps having radix tree. > > I took a small stab at doing some cleaning of the Perl scripts, mainly around > using the more modern (well, modern as in +15 years old) form for open(..), > avoiding global filehandles for passing scalar references and enforcing use > strict. Some smaller typos and fixes were also included. It seems my Perl > has > become a bit rusty so I hope the changes make sense. The produced files are > identical with these patches applied, they are merely doing cleaning as > opposed > to bugfixing. > > The attached patches are against the 0001-0006 patches from Heikki and you in > this series of emails, the separation is intended to make them easier to read. I'm not sure how the discussion about this goes, these patches makes me think about coding style of Perl. The distinction between executable script and library is by intention with an obscure basis. Existing scripts don't get less modification, but library uses more restricted scopes to get rid of the troubles caused by using global scopes. But I don't have a clear preference on that. The TAP test scripts takes OO notations but I'm not sure convutils.pl also be better to take the same notation. It would be rarely edited hereafter and won't gets grown any more. As far as I see the obvious bug fixes in the patchset are the following, - 0007: load_maptable fogets to close input file. - 0010: commment for load_maptables is wrong. - 0011: hash reference is incorrectly dereferenced All other fixes other than the above three seem to be styling or syntax-generation issues and I don't know whether any recommendation exists... regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
> On 27 Oct 2016, at 09:23, Kyotaro HORIGUCHI> wrote: > > Hello, thank you very much for the work. My work became quite > easier with it. > > At Tue, 25 Oct 2016 12:23:48 +0300, Heikki Linnakangas > wrote in <08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi> >> >> [..] >> The perl scripts are still quite messy. For example, I lost the checks >> for duplicate mappings somewhere along the way - that ought to be put >> back. My Perl skills are limited. > > Perl scripts are to be messy, I believe. Anyway the duplicate > check as been built into the sub print_radix_trees. Maybe the > same check is needed by some plain map files but it would be just > duplication for the maps having radix tree. I took a small stab at doing some cleaning of the Perl scripts, mainly around using the more modern (well, modern as in +15 years old) form for open(..), avoiding global filehandles for passing scalar references and enforcing use strict. Some smaller typos and fixes were also included. It seems my Perl has become a bit rusty so I hope the changes make sense. The produced files are identical with these patches applied, they are merely doing cleaning as opposed to bugfixing. The attached patches are against the 0001-0006 patches from Heikki and you in this series of emails, the separation is intended to make them easier to read. cheers ./daniel 0007-Fix-filehandle-usage.patch Description: Binary data 0008-Make-all-scripts-use-strict-and-rearrange-logic.patch Description: Binary data 0009-Use-my-instead-of-local.patch Description: Binary data 0010-Various-small-style-nits-and-typos.patch Description: Binary data 0011-Fix-hash-lookup.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Hello, At Fri, 28 Oct 2016 09:42:25 -0400, Tom Lanewrote in <13049.1477662...@sss.pgh.pa.us> > Robert Haas writes: > > On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHI > > wrote: > >> Perhaps we can put the files into our repositoy by providing some > >> notifications. > > > Uggh, I don't much like advertising clauses. > > Even if the license were exactly compatible with ours, I'd be -1 on > bloating our tarballs with these files. They're large and only a > tiny fraction of developers, let alone end users, will ever care > to look at them. I understood that the intention of Heikki's suggestion, that is, these might be included in PostgreSQL's repository, is looking for a kind of stability, or consistency. The source files are not revision-mangaged. In case where the authorities get unwanted changes or no longer avaiable, .map files have to be edited irelevantly from the authority files maybe from the reason that . Actually some map files have lost their authority file or other map files have got several direct modifications. We will be free from such disturbance by containing "frozen" authority files. On the other hand, I also agree that the advertising or additional bloats of source repositiry are a nuisance. > I think it's fine as long as we have a README file that explains > where to get them. (I'm not even very thrilled with the proposed > auto-download script, as it makes undesirable assumptions about > which internet tools you use, not to mention that it won't work > at all on Windows.) Mmm. It would be a pain in the neck. Some of the files are already stored in "OBSOLETE" directory in the Unicode consortium ftp site, and one of them has been vanished and available from another place, a part of ICU source tree. On the other hand map files are assumed to be generated from the scripts and are to discuraged to be directly edited. Radix map files are uneditable and currently made from the authority files. If some authority files are gone, the additional edit have to be done directly onto map files, and they are in turn to be the authority for radix files. (it's quite easy to chage the authority to current map files, though). By the way, the following phrase of the terms of license. http://www.unicode.org/copyright.html | COPYRIGHT AND PERMISSION NOTICE | | Copyright (c) 1991-2016 Unicode, Inc. All rights reserved. | Distributed under the Terms of Use in http://www.unicode.org/copyright.html. | | Permission is hereby granted, free of charge, to any person obtaining | a copy of the Unicode data files and any associated documentation | (the "Data Files") or Unicode software and any associated documentation | (the "Software") to deal in the Data Files or Software | without restriction, including without limitation the rights to use, | copy, modify, merge, publish, distribute, and/or sell copies of | the Data Files or Software, and to permit persons to whom the Data Files | or Software are furnished to do so, provided that either | (a) this copyright and permission notice appear with all copies | of the Data Files or Software, or | (b) this copyright and permission notice appear in associated | Documentation. I'm afraid that the map (and _radix.map files) are the translates of the "Data Files", and 'translate' is a part of 'modify'. Either the notice is necessary or not, if we decide to wipe the 'true' authority out from our source files, I'd like to make the map files (preferably with comments) as the second authority, _radix.map files are to be getenerated from them, since they're not editable. > I'd actually vote for getting rid of the reference files we > have in the tree now (src/backend/utils/mb/Unicode/*txt), on > the same grounds. That's 600K of stuff that does not need to > be in our tarballs. Anyway, I'd like to register this as an item of this CF. regares, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On Fri, Oct 28, 2016 at 09:18:08AM -0400, Robert Haas wrote: > On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHI >wrote: > > | COPYRIGHT AND PERMISSION NOTICE > > | > > | Copyright (c) 1991-2016 Unicode, Inc. All rights reserved. > > | Distributed under the Terms of Use in > > http://www.unicode.org/copyright.html. > > | > > | Permission is hereby granted, free of charge, to any person obtaining > > | a copy of the Unicode data files and any associated documentation > > | (the "Data Files") or Unicode software and any associated documentation > > | (the "Software") to deal in the Data Files or Software > > | without restriction, including without limitation the rights to use, > > | copy, modify, merge, publish, distribute, and/or sell copies of > > | the Data Files or Software, and to permit persons to whom the Data Files > > | or Software are furnished to do so, provided that either > > | (a) this copyright and permission notice appear with all copies > > | of the Data Files or Software, or > > | (b) this copyright and permission notice appear in associated > > | Documentation. > > > > Perhaps we can put the files into our repositoy by providing some > > notifications. > > Uggh, I don't much like advertising clauses. Your dislike is pretty common. Might it be worth reaching out to the Unicode consortium about this? They may well have added that as boilerplate without really considering the effects, and they even have a popup that specifically addresses licensing. http://www.unicode.org/reporting.html Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Robert Haaswrites: > On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHI > wrote: >> Perhaps we can put the files into our repositoy by providing some >> notifications. > Uggh, I don't much like advertising clauses. Even if the license were exactly compatible with ours, I'd be -1 on bloating our tarballs with these files. They're large and only a tiny fraction of developers, let alone end users, will ever care to look at them. I think it's fine as long as we have a README file that explains where to get them. (I'm not even very thrilled with the proposed auto-download script, as it makes undesirable assumptions about which internet tools you use, not to mention that it won't work at all on Windows.) I'd actually vote for getting rid of the reference files we have in the tree now (src/backend/utils/mb/Unicode/*txt), on the same grounds. That's 600K of stuff that does not need to be in our tarballs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHIwrote: > | COPYRIGHT AND PERMISSION NOTICE > | > | Copyright (c) 1991-2016 Unicode, Inc. All rights reserved. > | Distributed under the Terms of Use in http://www.unicode.org/copyright.html. > | > | Permission is hereby granted, free of charge, to any person obtaining > | a copy of the Unicode data files and any associated documentation > | (the "Data Files") or Unicode software and any associated documentation > | (the "Software") to deal in the Data Files or Software > | without restriction, including without limitation the rights to use, > | copy, modify, merge, publish, distribute, and/or sell copies of > | the Data Files or Software, and to permit persons to whom the Data Files > | or Software are furnished to do so, provided that either > | (a) this copyright and permission notice appear with all copies > | of the Data Files or Software, or > | (b) this copyright and permission notice appear in associated > | Documentation. > > Perhaps we can put the files into our repositoy by providing some > notifications. Uggh, I don't much like advertising clauses. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On 10/07/2016 06:55 PM, Robert Haas wrote: On Fri, Oct 7, 2016 at 6:46 AM, Heikki Linnakangaswrote: Ouch. We should find and document an authoritative source for all the mappings we have... I think the next steps here are: 1. Find an authoritative source for all the existing mappings. 2. Generate the radix tree files directly from the authoritative sources, instead of the existing *.map files. 3. Completely replace the existing binary-search code with this. It might be best to convert using the existing map files, and then update the mappings later. Otherwise, when things break, you won't know what to blame. I was thinking that we keep the mappings unchanged, but figure out where we got the mappings we have. An authoritative source may well be "file X from unicode, with the following tweaks: ...". As long as we have some way of representing that, in text files, or in perl code, that's OK. What I don't want is that the current *.map files are turned into the authoritative source files, that we modify by hand. There are no comments in them, for starters, which makes hand-editing cumbersome. It seems that we have edited some of them by hand already, but we should rectify that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
Robert Haaswrites: > On Fri, Oct 7, 2016 at 6:46 AM, Heikki Linnakangas wrote: >> Ouch. We should find and document an authoritative source for all the >> mappings we have... >> >> I think the next steps here are: >> >> 1. Find an authoritative source for all the existing mappings. >> 2. Generate the radix tree files directly from the authoritative sources, >> instead of the existing *.map files. >> 3. Completely replace the existing binary-search code with this. > It might be best to convert using the existing map files, and then > update the mappings later. Otherwise, when things break, you won't > know what to blame. I think I went through this exercise last year or so, and updated the notes about the authoritative sources where I was able to find one. In the remaining cases, I believe that the maps have been intentionally tweaked and we should be cautious about undoing that. Tatsuo-san might remember more about why they are the way they are. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On Fri, Oct 7, 2016 at 6:46 AM, Heikki Linnakangaswrote: > Ouch. We should find and document an authoritative source for all the > mappings we have... > > I think the next steps here are: > > 1. Find an authoritative source for all the existing mappings. > 2. Generate the radix tree files directly from the authoritative sources, > instead of the existing *.map files. > 3. Completely replace the existing binary-search code with this. It might be best to convert using the existing map files, and then update the mappings later. Otherwise, when things break, you won't know what to blame. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
On 10/07/2016 11:36 AM, Kyotaro HORIGUCHI wrote: The radix conversion function and map conversion script became more generic than the previous state. So I could easily added radix conversion of EUC_JP in addition to SjiftJIS. nm -S said that the size of radix tree data for sjis->utf8 conversion is 34kB and that for utf8->sjis is 46kB. (eucjp->utf8 57kB, utf8->eucjp 93kB) LUmapSJIS and ULmapSJIS was 62kB and 59kB, and LUmapEUC_JP and ULmapEUC_JP was 106kB and 105kB. If I'm not missing something, radix tree is faster and require less memory. Cool! Currently the tree structure is devided into several elements, One for 2-byte, other ones for 3-byte and 4-byte codes and output table. The other than the last one is logically and technically merged into single table but it makes the generator script far complex than the current complexity. I no longer want to play hide'n seek with complex perl object.. I think that's OK. There isn't really anything to gain by merging them. It might be better that combining this as a native feature of the core. Currently the helper function is in core but that function is given as conv_func on calling LocalToUtf. Yeah, I think we want to completely replace the current binary-search based code with this. I would rather maintain just one mechanism. Current implement uses *.map files of pg_utf_to_local as input. It seems not good but the radix tree files is completely uneditable. Provide custom made loading functions for every source instead of load_chartable() would be the way to go. # However, for example utf8_to_sjis.map, it doesn't seem to have # generated from the source mentioned in UCS_to_SJIS.pl Ouch. We should find and document an authoritative source for all the mappings we have... I think the next steps here are: 1. Find an authoritative source for all the existing mappings. 2. Generate the radix tree files directly from the authoritative sources, instead of the existing *.map files. 3. Completely replace the existing binary-search code with this. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers