Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-17 Thread Tom Lane
Mark Dilger  writes:
> I use CREATE RULE within startup files in the fork that I maintain.  I have
> lots of them, totaling perhaps 50k lines of rule code.  I don't think any of 
> that
> code would have a problem with the double-newline separation you propose,
> which seems a more elegant solution to me.  Admittedly, the double-newline
> separation would need to be documented at the top of each sql file, otherwise
> it would be quite surprising to those unfamiliar with it.

> You mentioned upthread that introducing a syntax error in one of these files
> results in a not-so-helpful error message that dumps the contents of the
> entire file.  I can confirm that happens, and is hardly useful.

Not having heard any ideas that sounded better than semi-newline-newline,
I went ahead and finished up this patch on that basis.  Attached are two
patch files; the first one redefines the behavior of -j, and the second
one modifies initdb to use only one standalone-backend run.  I present
them this way to emphasize that the -j change doesn't break much of
anything: initdb still works if you apply only the first patch.  And
I didn't change anything in the initdb input files, except for adding
the comment documentation Mark suggests above.

In passing in the first patch, I got rid of the TCOP_DONTUSENEWLINE
#define, which could not have been used by anyone in a very long time
because it would break initdb.  I then realized that
src/include/tcop/tcopdebug.h is completely dead code, because the
other debugging symbol it claims to specify got ripped out long ago.
And to add insult to injury, that file is included noplace.  I imagine
Bruce's pgrminclude tool got rid of the inclusion that must once have
existed in postgres.c, after observing that postgres.c still compiled
without it :-(.  (That tool really requires more adult supervision
than it has gotten.) So anyway, this patch removes tcopdebug.h entirely.

The second patch consists of removing extra backend starts/stops
and converting all of initdb's code to run in -j mode, rather than the
mishmash of -j and not-j behavior that was there before.  I changed
all the semicolon-newlines in initdb's command strings to
semicolon-newline-newlines.  As mentioned before, only a small number of
those changes *had* to be made to get it to work, namely the ones around
VACUUM and CREATE DATABASE statements, but I felt that for consistency
and error localization all of them should be changed.  I also failed to
resist the temptation to const-ify some of the arrays more thoroughly.

Barring objections I'll push this into HEAD soon.

regards, tom lane

diff --git a/doc/src/sgml/ref/postgres-ref.sgml b/doc/src/sgml/ref/postgres-ref.sgml
index e2e9909..d60d4ff 100644
*** a/doc/src/sgml/ref/postgres-ref.sgml
--- b/doc/src/sgml/ref/postgres-ref.sgml
*** PostgreSQL documentation
*** 529,535 
  
  
  
!  The following options only apply to the single-user mode.
  
  
  
--- 529,537 
  
  
  
!  The following options only apply to the single-user mode
!  (see ).
  
  
  
*** PostgreSQL documentation
*** 558,564 
-E

 
! Echo all commands.
 

   
--- 560,566 
-E

 
! Echo all commands to standard output before executing them.
 

   
*** PostgreSQL documentation
*** 567,573 
-j

 
! Disables use of newline as a statement delimiter.
 

   
--- 569,576 
-j

 
! Use semicolon followed by two newlines, rather than just newline,
! as the command entry terminator.
 

   
*** PostgreSQL documentation
*** 760,767 

   
  
!  
!   Usage
  
 
  To start a single-user mode server, use a command like
--- 763,770 

   
  
!  
!   Single-User Mode
  
 
  To start a single-user mode server, use a command like
*** PostgreSQL documentation
*** 778,807 
  entry terminator; there is no intelligence about semicolons,
  as there is in psql.  To continue a command
  across multiple lines, you must type backslash just before each
! newline except the last one.
 
  
 
! But if you use the -j command line switch, then newline does
! not terminate command entry.  In this case, the server will read the standard input
! until the end-of-file (EOF) marker, then
! process the input as a single command string.  Backslash-newline is not
! treated specially in this case.
 
  
 
  To quit the session, type EOF
  (ControlD, usually).
! If you've
! used -j, two consecutive EOFs are needed to exit.
 
  
 
  Note that the single-user mode server does not provide sophisticated
  line-editing features (no command history, for 

Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-13 Thread Craig Ringer
On 13 December 2015 at 06:31, Tom Lane  wrote:


> I'm not particularly wedded to this rule.  In principle we could go so
> far as to import psql's code that parses commands and figures out which
> semicolons are command terminators --- but that is a pretty large chunk
> of code, and I think it'd really be overkill considering that initdb
> deals only with fixed input scripts.


Shouldn't that be a bison/flex job anyway, rather than hand-coded? Or a
simple(ish) state machine?

Dollar-quoted strings are probably the only quite ugly bit due to their
arbitrary delimiters. So I thought I'd sketch out how it'd look as a state
machine. At which point I remembered that we allow $ in identifiers too. So
the state machine would have to bother with unquoted identifiers. Of course
$ marks parameters, so it has to keep track of if it's reading a parameter.
At which point you have half an SQL parser.

This strikes me as a really good reason for making it re-usable, because
it's horrid to write code that handles statement splitting in the
PostgreSQL dialect.

Optional handling of psql \commands would be required, but that'd make it
easier for PgAdmin to support psql backslash commands, so there'd be a win
there too.

I figured I'd sketch it out for kicks. Comment: yuck.

States would be at least:

SQL_TEXT
SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD
NUMBER
QUOTED_IDENTIFIER
QUOTED_IDENTIFIER_QUOTE
SQL_TEXT_DOLLAR
DOLLAR_STRING_START_DELIM
DOLLAR_STRING
DOLLAR_STRING_DOLLAR
DOLLAR_STRING_END_DELIM
STANDARD_STRING
STANDARD_STRING_QUOTE
SQL_TEXT_E
ESCAPE_STRING
ESCAPE_STRING_ESCAPE

Transitions

SQL_TEXT => { SQL_TEXT, SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD, NUMBER,
QUOTED_IDENTIFIER, SQL_TEXT_DOLLAR, STANDARD_STRING, SQL_TEXT_E,
ESCAPE_STRING }

SQL_TEXT_E => { SQL_TEXT, ESCAPE_STRING, SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD
}

SQL_TEXT_DOLLAR => { SQL_TEXT, NUMBER, DOLLAR_STRING_START_DELIM }

QUOTED_IDENTIFIER => { QUOTED_IDENTIFIER, QUOTED_IDENTIFIER_QUOTE }

QUOTED_IDENTIFIER_QUOTE => { SQL_TEXT, QUOTED_IDENTIFIER }

DOLLAR_STRING_START_DELIM => { DOLLAR_STRING_START_DELIM, DOLLAR_STRING }

DOLLAR_STRING => { DOLLAR_STRING, DOLLAR_STRING_DOLLAR }

DOLLAR_STRING_END_DELIM => { DOLLAR_STRING_END_DELIM, SQL_TEXT,
DOLLAR_STRING }

STANDARD_STRING => { STANDARD_STRING, STANDARD_STRING_QUOTE }

STANDARD_STRING_QUOTE => { SQL_TEXT, STANDARD_STRING }

ESCAPE_STRING => { ESCAPE_STRING, ESCAPE_STRING_ESCAPE }

ESCAPE_STRING_ESCAPE => { SQL_TEXT, ESCAPE_STRING }


NUMBER consumes sequential digits and period chars and returns to SQL_TEXT
at any non-digit. (That way it can handle Pg's lazy parser quirks like
SELECT 123"f" being legal, and equivalent to SELECT 123 AS "f").

SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD is needed because a $ within an
identifier is part of the identifier so it can't just be consumed as
SQL_TEXT .

For dollar strings, when a $ is found when reading SQL text (not an
identifier/keyword), enter SQL_TEXT_DOLLAR. What comes next must be a
parameter or the start of a dollar string. If the next char is a digit then
it's a parameter so switch to NUMBER, since dollar-quoted string delims
follow identifier rules and unquoted identifiers can't start with a number.
Otherwise switch to DOLLAR_STRING_START_DELIM and consume until a $ is
found or something illegal in an identifier is found. Or of course it could
be lone $ which is bogus syntax but as far as we're concerned still just
SQL_TEXT. Really, this is just looking for a dollar-quote start and doesn't
care what it finds if it isn't a valid dollar-quote start.

If a valid dollar-quote delim is found switch to DOLLAR_STRING and read
until we find the matching delim using a similar process, entering
DOLLAR_STRING_DOLLAR, looking for param vs end delim, etc. When a full
delimiter is read compare to the start delimiter and switch back to
SQL_TEXT mode if it matches, otherwise remain in DOLLAR_STRING.

If an invalid dollar string delim was found switch back to SQL_TEXT (since
it wasn't a valid beginning of a dollar string) and continue.

For QUOTED_IDENTIFIER_QUOTE and STANDARD_STRING_QUOTE, it found a " or '
while reading a quoted identifier or standard string and transitioned into
that state. If the next char doubles the quote it'll return to reading the
string; otherwise it'll return to the SQL_TEXT state since the identifier
or literal has ended.

Similarly with ESCAPE_STRING_ESCAPE. Having found an escape, consume the
next char even if it's a quote and return to the string parsing.

All this ignores psql backslash commands.

Have I missed anything really obvious? Does it seem useful to have more
re-usable statement splitting code? Is there any sane justification for
doing anything but extracting what psql does verbatim while carefully
changing nothing?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-13 Thread Tom Lane
Craig Ringer  writes:
> On 13 December 2015 at 06:31, Tom Lane  wrote:
>> I'm not particularly wedded to this rule.  In principle we could go so
>> far as to import psql's code that parses commands and figures out which
>> semicolons are command terminators --- but that is a pretty large chunk
>> of code, and I think it'd really be overkill considering that initdb
>> deals only with fixed input scripts.

> Shouldn't that be a bison/flex job anyway, rather than hand-coded?

It is, if you're speaking of how psql does it.

I thought about trying to get the backend's existing lexer to do it,
but that code will want to throw an error if it sees unterminated
input (such as an incomplete slash-star comment).  I'm not sure that
it'd be a good thing to try to make that lexer serve two masters.

I'm also getting less and less enthused about trying to share code with
psql.  In the first place, the backend has no interest in recognizing
psql backslash-commands, nor does it need to deal with some of the weird
constraints psql has like having to handle non-backend-safe encodings.
In the second, while it's reasonable for psql to deal with CREATE RULE
syntax by counting parentheses, there's a good argument that that is
not the behavior we want for noninteractive situations such as reading
information_schema.sql.  We won't, for example, have anything
corresponding to psql's changing input prompt to help debug problems.
In the third place, it's just difficult and ugly to write code that
will work in both backend and frontend contexts.  We've done it,
certainly, but not for any problem as involved as this would be.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-13 Thread Mark Dilger

> On Dec 12, 2015, at 9:40 PM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>>> On Dec 12, 2015, at 3:42 PM, Tom Lane  wrote:
>>> ... In general, though, I'd rather not try to
>>> teach InteractiveBackend() such a large amount about SQL syntax.
> 
>> I use CREATE RULE within startup files in the fork that I maintain.  I have
>> lots of them, totaling perhaps 50k lines of rule code.  I don't think any of 
>> that
>> code would have a problem with the double-newline separation you propose,
>> which seems a more elegant solution to me.
> 
> Yeah?  Just for proof-of-concept, could you run your startup files with
> the postgres.c patch as proposed, and see whether you get any failures?

Given all the changes I've made to initdb.c in my fork, that patch
of yours doesn't apply.

mark 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-12 Thread Tom Lane
Mark Dilger  writes:
>> On Dec 12, 2015, at 3:42 PM, Tom Lane  wrote:
>> ... In general, though, I'd rather not try to
>> teach InteractiveBackend() such a large amount about SQL syntax.

> I use CREATE RULE within startup files in the fork that I maintain.  I have
> lots of them, totaling perhaps 50k lines of rule code.  I don't think any of 
> that
> code would have a problem with the double-newline separation you propose,
> which seems a more elegant solution to me.

Yeah?  Just for proof-of-concept, could you run your startup files with
the postgres.c patch as proposed, and see whether you get any failures?

> Admittedly, the double-newline
> separation would need to be documented at the top of each sql file, otherwise
> it would be quite surprising to those unfamiliar with it.

Agreed, that wouldn't be a bad thing.

I thought of a positive argument not to do the "fully right" thing by
means of implementing the exactly-right command boundary rules.  Suppose
that you mess up in information_schema.sql or another large input file
by introducing an extra left parenthesis in some query.  What would happen
if InteractiveBackend() were cognizant of the paren-matching rule is that
it would slurp everything till the end-of-file and then produce a syntax
error message quoting all that text; not much better than what happens
today.  With a command break rule like semi-newline-newline, there'll be
a limited horizon as to how much text gets swallowed before you get the
error message.

Note that this is different from the situation with a fully interactive
input processor like psql: if you're typing the same thing in psql,
you'll realize as soon as it doesn't execute the command when-expected
that something is wrong.  You won't type another thousand lines of input
before looking closely at what you typed already.

I'm still not quite sold on semi-newline-newline as being the best
possible command boundary rule here; but I do think that "fully correct"
boundary rules are less attractive than they might sound.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-12 Thread Tom Lane
Mark Dilger  writes:
>> On Dec 11, 2015, at 2:54 PM, Caleb Welton  wrote:
>> Compare:
>> CREATE FUNCTION lo_export(oid, text) RETURNS integer LANGUAGE internal 
>> STRICT AS 'lo_export' WITH (OID=765);  
>> 
>> DATA(insert OID = 765 (  lo_export  PGNSP PGUID 12 1 0 0 0 f f f 
>> f t f v u 2 0 23 "26 25" _null_ _null_ _null_ _null_ _null_ lo_export _null_ 
>> _null_ _null_ ));

> I would like to hear more about this idea.  Are you proposing that we use 
> something
> like the above CREATE FUNCTION format to express what is currently being 
> expressed
> with DATA statements?

Yes, that sort of idea has been kicked around some already, see the
archives.

> That is an interesting idea, though I don't know what exactly
> that would look like.  If you want to forward this idea, I'd be eager to hear 
> your thoughts.
> If not, I'll try to make progress with my idea of tab delimited files and 
> such (or really,
> Alvaro's idea of csv files that I only slightly corrupted).

Personally I would like to see both approaches explored.  Installing as
much as we can via SQL commands is attractive for a number of reasons;
but there is going to be an irreducible minimum amount of stuff that
has to be inserted by something close to the current bootstrapping
process.  (And I'm not convinced that that "minimum amount" is going
to be very small...)  So it's not impossible that we'd end up accepting
*both* types of patches, one to do more in the post-bootstrap SQL world
and one to make the bootstrap data notation less cumbersome.  In any
case it would be useful to push both approaches forward some more before
we make any decisions between them.

BTW, there's another thing I'd like to see improved in this area, which is
a problem already but will get a lot worse if we push more work into the
post-bootstrap phase of initdb.  That is that the post-bootstrap phase is
both inefficient and impossible to debug.  If you've ever had a failure
there, you'll have seen that the backend spits out an entire SQL script
and says there's an error in it somewhere; that's because it gets the
whole per-stage script as one submission.  (Try introducing a syntax error
somewhere in information_schema.sql, and you'll see what I mean.)
Breaking the stage scripts down further would help, but that is
unattractive because each one requires a fresh backend startup/shutdown,
including a full checkpoint.  I'd like to see things rejiggered so that
there's only one post-bootstrap standalone backend session that performs
all the steps, but initdb feeds it just one SQL command at a time so that
errors are better localized.  That should both speed up initdb noticeably
and make debugging easier.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-12 Thread Mark Dilger

> On Dec 12, 2015, at 3:42 PM, Tom Lane  wrote:
> 
> Joe Conway  writes:
>> On 12/12/2015 02:31 PM, Tom Lane wrote:
>>> I'm not particularly wedded to this rule.  In principle we could go so
>>> far as to import psql's code that parses commands and figures out which
>>> semicolons are command terminators --- but that is a pretty large chunk
>>> of code, and I think it'd really be overkill considering that initdb
>>> deals only with fixed input scripts.  But if anyone has another simple
>>> rule for breaking SQL into commands, we can certainly discuss
>>> alternatives.
> 
>> Possibly inadequate, but I wrote a get_one_query() function to grab one
>> statement at a time from a possibly multi-statement string and it isn't
>> all that many lines of code:
>>  https://github.com/jconway/pgsynck/blob/master/pgsynck.c
> 
> Hmm.  Doesn't look like that handles semicolons embedded in CREATE RULE;
> for that you'd have to track parenthesis nesting as well.  (It's arguable
> that we won't ever need that case during initdb, but I'd just as soon not
> wire in such an assumption.)  In general, though, I'd rather not try to
> teach InteractiveBackend() such a large amount about SQL syntax.

I use CREATE RULE within startup files in the fork that I maintain.  I have
lots of them, totaling perhaps 50k lines of rule code.  I don't think any of 
that
code would have a problem with the double-newline separation you propose,
which seems a more elegant solution to me.  Admittedly, the double-newline
separation would need to be documented at the top of each sql file, otherwise
it would be quite surprising to those unfamiliar with it.

You mentioned upthread that introducing a syntax error in one of these files
results in a not-so-helpful error message that dumps the contents of the
entire file.  I can confirm that happens, and is hardly useful.

mark



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-12 Thread Tom Lane
Andres Freund  writes:
> That's cool too. Besides processing the .bki files, and there largely
> reg*_in, the many restarts are the most expensive parts of initdb.

BTW, in case anyone is doubting it, I did a little bit of "perf" tracing
and confirmed Andres' comment here: more than 50% of the runtime of the
bootstrap phase is eaten by the pg_proc seqscans performed by regprocin.
There's nothing else amounting to more than a few percent, so basically
nothing else in bootstrap is worth optimizing before we fix that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-12 Thread Joe Conway
On 12/12/2015 02:31 PM, Tom Lane wrote:
> I'm not particularly wedded to this rule.  In principle we could go so
> far as to import psql's code that parses commands and figures out which
> semicolons are command terminators --- but that is a pretty large chunk
> of code, and I think it'd really be overkill considering that initdb
> deals only with fixed input scripts.  But if anyone has another simple
> rule for breaking SQL into commands, we can certainly discuss
> alternatives.

Possibly inadequate, but I wrote a get_one_query() function to grab one
statement at a time from a possibly multi-statement string and it isn't
all that many lines of code:

  https://github.com/jconway/pgsynck/blob/master/pgsynck.c

> Anyway, the attached patch tweaks postgres.c to follow that rule instead
> of slurp-to-EOF when -j is given.  I doubt that being non-backwards-
> compatible is a problem here; in fact, I'm tempted to rip out the -j
> switch altogether and just have standalone mode always parse input the
> same way.  Does anyone know of people using standalone mode other than
> for initdb?

sepgsql uses it for installation, but it does not appear to use -j
I'm not sure why it is required but at some point I'd like to dig into that.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-12 Thread Tom Lane
I wrote:
> BTW, there's another thing I'd like to see improved in this area, which is
> a problem already but will get a lot worse if we push more work into the
> post-bootstrap phase of initdb.  That is that the post-bootstrap phase is
> both inefficient and impossible to debug.  If you've ever had a failure
> there, you'll have seen that the backend spits out an entire SQL script
> and says there's an error in it somewhere; that's because it gets the
> whole per-stage script as one submission.  (Try introducing a syntax error
> somewhere in information_schema.sql, and you'll see what I mean.)
> Breaking the stage scripts down further would help, but that is
> unattractive because each one requires a fresh backend startup/shutdown,
> including a full checkpoint.  I'd like to see things rejiggered so that
> there's only one post-bootstrap standalone backend session that performs
> all the steps, but initdb feeds it just one SQL command at a time so that
> errors are better localized.  That should both speed up initdb noticeably
> and make debugging easier.

I thought this sounded like a nice lazy-Saturday project, so I started
poking at it, and attached is a WIP patch.  The core issue that has to
be dealt with is that standalone-backend mode currently has just two
rules for deciding when to stop collecting input and execute the command
buffer, and they both suck:

1. By default, execute after every newline.  (Actually, you can quote
a newline with a backslash, but we don't use that ability anywhere.)

2. With -j, slurp the entire input until EOF, and execute it as one
giant multicommand string.

We're doing #2 to handle information_schema.sql and the other large
SQL scripts that initdb runs, which is why the response to an error in
those scripts is so yucky.

After some experimentation, I came up with the idea of executing any
time that a semicolon followed by two newlines is seen.  This nicely
breaks up input like information_schema.sql.  There are probably some
residual places where more than one command is executed in a single
string, but we could fix that with some more newlines.  Obviously,
this rule is capable of being fooled if you have a newline followed by
a blank line in a comment or quoted literal --- but it turns out that
no such case exists anywhere in initdb's data.

I'm not particularly wedded to this rule.  In principle we could go so
far as to import psql's code that parses commands and figures out which
semicolons are command terminators --- but that is a pretty large chunk
of code, and I think it'd really be overkill considering that initdb
deals only with fixed input scripts.  But if anyone has another simple
rule for breaking SQL into commands, we can certainly discuss
alternatives.

Anyway, the attached patch tweaks postgres.c to follow that rule instead
of slurp-to-EOF when -j is given.  I doubt that being non-backwards-
compatible is a problem here; in fact, I'm tempted to rip out the -j
switch altogether and just have standalone mode always parse input the
same way.  Does anyone know of people using standalone mode other than
for initdb?

The other part of the patch modifies initdb to do all its post-bootstrap
steps using a single standalone backend session.  I had to remove the
code that currently prints out progress markers for individual phases
of that processing, so that now you get output that looks like

creating directory /home/postgres/testversion/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /home/postgres/testversion/data/base/1 ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Since initdb is just printing to the backend in an open-loop fashion,
it doesn't really know whether each command succeeds; in fact it is
usually pretty far ahead of the backend, until it does pclose() which
waits for the subprocess to exit.  So we can't readily keep the old
progress markers.  I don't think we'll miss them though.  The whole
"post-bootstrap initialization" step only takes a second or so on my
main dev machine, so breaking down the progress more finely isn't that
useful anymore.

I had to change ;\n to ;\n\n in a few places in initdb's internal
scripts, to ensure that VACUUM commands were executed by themselves
(otherwise you get "VACUUM can't run in a transaction block" type
failures).  I wasn't very thorough about that though, pending a
decision on exactly what the new command-boundary rule will be.

The upshot of these changes is that initdb runs about 10% faster overall
(more in -N mode), which is a useful savings.  Also, the response to a
syntax error in information_schema.sql now looks like this:

creating template1 database in /home/postgres/testversion/data/base/1 ... ok
performing post-bootstrap initialization ... FATAL:  column 

Re: [HACKERS] Bootstrap DATA is a pita

2015-12-12 Thread Andres Freund
On 2015-12-12 13:28:28 -0500, Tom Lane wrote:
> BTW, there's another thing I'd like to see improved in this area, which is
> a problem already but will get a lot worse if we push more work into the
> post-bootstrap phase of initdb.  That is that the post-bootstrap phase is
> both inefficient and impossible to debug.  If you've ever had a failure
> there, you'll have seen that the backend spits out an entire SQL script
> and says there's an error in it somewhere; that's because it gets the
> whole per-stage script as one submission.

Seen that more than once :(

> Breaking the stage scripts down further would help, but that is
> unattractive because each one requires a fresh backend startup/shutdown,
> including a full checkpoint.  I'd like to see things rejiggered so that
> there's only one post-bootstrap standalone backend session that performs
> all the steps, but initdb feeds it just one SQL command at a time so that
> errors are better localized.  That should both speed up initdb noticeably
> and make debugging easier.

One way to do that would be to not use the single user mode for that
stage. Afair, at that point we could actually just start the cluster
"normally", with the socket pointing somewhere locally (ugh, some path
length issues afoot, maybe allow relative directories? And, uh,
windows), and use psql to do the splitting and everything for us.

Your approach has probably some significant performance benefits,
because it essentially does pipelining. So while aesthetically
attractive, I'm afraid my proposal would lead to worse performance. So
it's probably actually DOA :(

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-12 Thread Andres Freund
On 2015-12-12 17:31:49 -0500, Tom Lane wrote:
> I thought this sounded like a nice lazy-Saturday project, so I started
> poking at it, and attached is a WIP patch.

Not bad, not bad at all.


> After some experimentation, I came up with the idea of executing any
> time that a semicolon followed by two newlines is seen.  ...
> but it turns out that no such case exists anywhere in initdb's data.

Not pretty, but hardly any worse than the current situation.


> I'm not particularly wedded to this rule.  In principle we could go so
> far as to import psql's code that parses commands and figures out which
> semicolons are command terminators --- but that is a pretty large chunk
> of code, and I think it'd really be overkill considering that initdb
> deals only with fixed input scripts.

Having that code somewhere abstracted wouldn't be bad though, extension
scripts have a somewhat similar problem.


> Anyway, the attached patch tweaks postgres.c to follow that rule instead
> of slurp-to-EOF when -j is given.  I doubt that being non-backwards-
> compatible is a problem here; in fact, I'm tempted to rip out the -j
> switch altogether and just have standalone mode always parse input the
> same way.

No objection here.


> Does anyone know of people using standalone mode other than
> for initdb?

Unfortunately yes. There's docker instances around that configure users
and everything using it.


> The other part of the patch modifies initdb to do all its post-bootstrap
> steps using a single standalone backend session.  I had to remove the
> code that currently prints out progress markers for individual phases
> of that processing, so that now you get output that looks like

That's cool too. Besides processing the .bki files, and there largely
reg*_in, the many restarts are the most expensive parts of initdb.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-12 Thread Tom Lane
Andres Freund  writes:
> On 2015-12-12 17:31:49 -0500, Tom Lane wrote:
>> Does anyone know of people using standalone mode other than
>> for initdb?

> Unfortunately yes. There's docker instances around that configure users
> and everything using it.

Hm, that means that we *do* have to worry about backwards compatibility.
We might be able to get away with changing the behavior of -j mode anyway,
though, since this proposal mostly only changes when execution happens
and not what is valid input for -j mode.  (It would probably break some
apps if we took away the switch, since right now, you do not need a
semicolon to terminate commands in the regular standalone mode.)
Failing that, we could define a new switch, I guess.

> That's cool too. Besides processing the .bki files, and there largely
> reg*_in, the many restarts are the most expensive parts of initdb.

Right.  The proposal we were discussing upthread would move all the reg*
lookups into creation of the .bki file, basically, which would improve
that part of things quite a bit.  (BTW, if we are concerned about initdb
speed, that might be a reason not to be too eager to shift processing
from bootstrap to non-bootstrap mode.  Other than the reg* issue,
bootstrap is certainly a far faster way to put rows into the catalogs
than individual SQL commands could be.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-12 Thread Tom Lane
Joe Conway  writes:
> On 12/12/2015 02:31 PM, Tom Lane wrote:
>> I'm not particularly wedded to this rule.  In principle we could go so
>> far as to import psql's code that parses commands and figures out which
>> semicolons are command terminators --- but that is a pretty large chunk
>> of code, and I think it'd really be overkill considering that initdb
>> deals only with fixed input scripts.  But if anyone has another simple
>> rule for breaking SQL into commands, we can certainly discuss
>> alternatives.

> Possibly inadequate, but I wrote a get_one_query() function to grab one
> statement at a time from a possibly multi-statement string and it isn't
> all that many lines of code:
>   https://github.com/jconway/pgsynck/blob/master/pgsynck.c

Hmm.  Doesn't look like that handles semicolons embedded in CREATE RULE;
for that you'd have to track parenthesis nesting as well.  (It's arguable
that we won't ever need that case during initdb, but I'd just as soon not
wire in such an assumption.)  In general, though, I'd rather not try to
teach InteractiveBackend() such a large amount about SQL syntax.

With a rule like "break at ;\n\n" it's possible to ensure that command
breaks occur only where wanted, though in corner cases you might have to
format your input oddly.  (For instance, if you needed that in a SQL
literal, you might resort to E';\n\n' or use the standard's rules about
concatenated string literals.)  If you get it wrong the consequences
aren't too disastrous: you'll get an unterminated-input syntax error,
or in the other direction multiple commands will get run together for
execution, which most of the time isn't a big issue.

>> Does anyone know of people using standalone mode other than
>> for initdb?

> sepgsql uses it for installation, but it does not appear to use -j
> I'm not sure why it is required but at some point I'd like to dig into that.

It might be easier than starting a full postmaster and having to figure
out a secure place for the socket etc.  I'm prepared to back off the
proposal about changing the default behavior of standalone mode; that
leaves us with a choice between changing -j's behavior and inventing
a new switch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Mark Dilger

> On Dec 11, 2015, at 1:46 PM, Tom Lane  wrote:
> 
> Alvaro Herrera  writes:
>> Crazy idea: we could just have a CSV file which can be loaded into a
>> table for mass changes using regular DDL commands, then dumped back from
>> there into the file.  We already know how to do these things, using
>> \copy etc.  Since CSV uses one line per entry, there would be no merge
>> problems either (or rather: all merge problems would become conflicts,
>> which is what we want.)
> 
> That's an interesting proposal.  It would mean that the catalog files
> stay at more or less their current semantic level (direct representations
> of bootstrap catalog contents), but it does sound like a more attractive
> way to perform complex edits than writing Emacs macros ;-).

I would be happy to work on this, if there is much chance of the community
accepting a patch.  Do you think replacing the numeric Oids for functions,
operators, opclasses and such in the source files with their names would
be ok, with the SQL converting those to Oids in the output?  My eyes have
gotten tired more than once trying to read head files in src/include/catalog
looking for mistakes in what largely amounts to a big table of numbers.

For example, in pg_amop.h:

/* default operators int2 */
DATA(insert (   1976   21 21 1 s95  403 0 ));
DATA(insert (   1976   21 21 2 s522 403 0 ));
DATA(insert (   1976   21 21 3 s94  403 0 ));
DATA(insert (   1976   21 21 4 s524 403 0 ));
DATA(insert (   1976   21 21 5 s520 403 0 ));

Would become something like:

amopfamily  amoplefttypeamoprighttype   amopstrategy
amoppurpose amopopr amopmethod  amopsortfamily
integer_ops int2int21   
search  "<" btree   0
integer_ops int2int22   
search  "<="btree   0
integer_ops int2int23   
search  "=" btree   0
integer_ops int2int24   
search  ">="btree   0
integer_ops int2int25   
search  ">" btree   0

Note that I prefer to use tabs and a headerline, as the tabstop can be set to
line them up nicely, and the headerline allows you to see which column is
which, and what it is for.  Csv is always harder for me to use that way, though
maybe that is just a matter of which editor i use.  (vim)

And yes, I'd need to allow the HEADER option for copying tab delimited
files, since it is currently only allowed for csv, I believe.

mark






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Tom Lane
Mark Dilger  writes:
>> On Dec 11, 2015, at 1:46 PM, Tom Lane  wrote:
>> That's an interesting proposal.  It would mean that the catalog files
>> stay at more or less their current semantic level (direct representations
>> of bootstrap catalog contents), but it does sound like a more attractive
>> way to perform complex edits than writing Emacs macros ;-).

> I would be happy to work on this, if there is much chance of the community
> accepting a patch.  Do you think replacing the numeric Oids for functions,
> operators, opclasses and such in the source files with their names would
> be ok, with the SQL converting those to Oids in the output?

Huh?  Those files are the definition of that mapping, no?  Isn't what
you're proposing circular?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Caleb Welton
I took a look at a few of the most recent bulk edit cases for pg_proc.h:

There were two this year:
* The addition of proparallel  [1]
* The addition of protransform [2]

And prior to that the most recent seems to be from 2012:
* The addition of proleakproof [3]

Quick TLDR - the changes needed to reflect these are super simple to
reflect when generating SQL for CREATE FUNCTION statements.

Attached is the SQL that would generate function definitions prior to
proleakproof and the diffs that would be required after adding support for
proleakproof, protransform and proparallel.

Each of the diffs indicates the changes that would be needed after the new
column is added, the question of how to populate default values for the new
columns is beyond the scope that can easily be expressed in general terms
and depends entirely on what the nature of the new column is.

Note: Currently I have focused on the 'pure' functions, e.g. not the
drivers of type serialization, language validation, operators, or other
object types.  I would want to deal with each of those while handling the
conversion for each of those object types in turn.  Additional
modifications would likely be needed for other types of functions.


[1]
https://github.com/postgres/postgres/commit/7aea8e4f2daa4b39ca9d1309a0c4aadb0f7ed81b
[2]
https://github.com/postgres/postgres/commit/8f9fe6edce358f7904e0db119416b4d1080a83aa
[3]
https://github.com/postgres/postgres/commit/cd30728fb2ed7c367d545fc14ab850b5fa2a4850


On Fri, Dec 11, 2015 at 12:55 PM, Caleb Welton  wrote:

> Makes sense.
>
> During my own prototyping what I did was generate the sql statements via
> sql querying the existing catalog.  Way easier than hand writing 1000+
> function definitions and not difficult to modify for future changes.  As
> affirmed that it was very easy to adapt my existing sql to account for some
> of the newer features in master.
>
> The biggest challenge was establishing a sort order that ensures both a
> unique ordering and that the dependencies needed for SQL functions have
> been processed before trying to define them.  Which effects about 4/1000
> functions based on a natural oid ordering.
>
> > On Dec 11, 2015, at 11:43 AM, Alvaro Herrera 
> wrote:
> >
> > Caleb Welton wrote:
> >> I'm happy working these ideas forward if there is interest.
> >>
> >> Basic design proposal is:
> >>  - keep a minimal amount of bootstrap to avoid intrusive changes to core
> >> components
> >>  - Add capabilities of creating objects with specific OIDs via DDL
> during
> >> initdb
> >>  - Update the caching/resolution mechanism for builtin functions to be
> >> more dynamic.
> >>  - Move as much of bootstrap as possible into SQL files and create
> catalog
> >> via DDL
> >
> > I think the point we got stuck last time at was deciding on a good
> > format for the data coming from the DATA lines.  One of the objections
> > raised for formats such as JSON is that it's trivial for "git merge" (or
> > similar tools) to make a mistake because object-end/object-start lines
> > are all identical.  And as for the SQL-format version, the objection was
> > that it's hard to modify the lines en-masse when modifying the catalog
> > definition (new column, etc).  Ideally we would like a format that can
> > be bulk-edited without too much trouble.
> >
> > A SQL file would presumably not have the merge issue, but mass-editing
> > would be a pain.
> >
> > Crazy idea: we could just have a CSV file which can be loaded into a
> > table for mass changes using regular DDL commands, then dumped back from
> > there into the file.  We already know how to do these things, using
> > \copy etc.  Since CSV uses one line per entry, there would be no merge
> > problems either (or rather: all merge problems would become conflicts,
> > which is what we want.)
> >
> > --
> > Álvaro Herrerahttp://www.2ndQuadrant.com/
> > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
*** gen_protransform.sql2015-12-11 14:36:25.0 -0800
--- gen_proparallel.sql 2015-12-11 14:35:41.0 -0800
***
*** 14,19 
--- 14,23 
ELSE '' END 
|| CASE WHEN proisstrict THEN ' STRICT' ELSE '' END 
|| CASE WHEN proleakproof THEN ' LEAKPROOF' ELSE '' END
+   || CASE proparallel WHEN 's' THEN ' PARALLEL SAFE'
+   WHEN 'r' THEN ' PARALLEL RESTRICTED'
+   WHEN 'u' THEN '' -- PARALLEL UNSAFE is DEFAULT
+   ELSE '' END
|| CASE WHEN (procost != 1   and lanname = 'internal') OR
 (procost != 100 and lanname = 'sql')
THEN ' COST ' 


gen_start.sql
Description: Binary data
*** gen_leakproof.sql   2015-12-11 14:36:09.0 -0800
--- gen_protransform.sql2015-12-11 14:36:25.0 -0800
***
*** 72,77 
  AND prorettype != 'anyenum'::regtype   /* Enum is special */
  AND 'anyenum'::regtype 

Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Caleb Welton
The current semantic level is pretty low level, somewhat cumbersome, and
requires filling in values that most of the time the system has a pretty
good idea how to fill in default values.

Compare:

CREATE FUNCTION lo_export(oid, text) RETURNS integer LANGUAGE internal
STRICT AS 'lo_export' WITH (OID=765);
DATA(insert OID = 765 (  lo_export   PGNSP PGUID 12 1 0 0 0 f f f f t f v u
2 0 23 "26 25" _null_ _null_ _null_ _null_ _null_ lo_export _null_ _null_
_null_ ));


In the first one someone has indicated:
   1. a function name,
   2. two parameter type names
   3. a return type
   4. a language
   5. null handling
   6. a symbol
   7. an oid

In the second case 30 separate items have been indicated, and yet both of
them will generate identical end results within the catalog.

The former is more immune to even needing modification in the event that
the catalog structure changes.
  - adding proleakproof?  No change needed, default value is correct
  - adding protransform?  No change needed, not relevant
  - adding proparallel?  No change needed, default value is correct
  - adding procost? No change needed, default value is correct




On Fri, Dec 11, 2015 at 1:46 PM, Tom Lane  wrote:

> Alvaro Herrera  writes:
> > Crazy idea: we could just have a CSV file which can be loaded into a
> > table for mass changes using regular DDL commands, then dumped back from
> > there into the file.  We already know how to do these things, using
> > \copy etc.  Since CSV uses one line per entry, there would be no merge
> > problems either (or rather: all merge problems would become conflicts,
> > which is what we want.)
>
> That's an interesting proposal.  It would mean that the catalog files
> stay at more or less their current semantic level (direct representations
> of bootstrap catalog contents), but it does sound like a more attractive
> way to perform complex edits than writing Emacs macros ;-).
>
> You could actually do that the hard way right now, with a bit of script
> to convert between DATA lines and CSV format.  But if we anticipate that
> becoming the standard approach, it would definitely make sense to migrate
> the master copies into CSV or traditional COPY format, and teach BKI mode
> to read that (or, perhaps, leave bootstrap.c alone and modify the code
> that produces the .bki file).
>
> This is somewhat orthogonal to the question of whether we want to do
> things like converting noncritical operator-class definitions into
> regular CREATE OPERATOR CLASS syntax.  There's almost certainly going
> to be some hard core of catalog entries that aren't amenable to that,
> and will still need to be loaded from data files of some sort.
>
> regards, tom lane
>


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Mark Dilger

> On Dec 11, 2015, at 2:40 PM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>>> On Dec 11, 2015, at 1:46 PM, Tom Lane  wrote:
>>> That's an interesting proposal.  It would mean that the catalog files
>>> stay at more or less their current semantic level (direct representations
>>> of bootstrap catalog contents), but it does sound like a more attractive
>>> way to perform complex edits than writing Emacs macros ;-).
> 
>> I would be happy to work on this, if there is much chance of the community
>> accepting a patch.  Do you think replacing the numeric Oids for functions,
>> operators, opclasses and such in the source files with their names would
>> be ok, with the SQL converting those to Oids in the output?
> 
> Huh?  Those files are the definition of that mapping, no?  Isn't what
> you're proposing circular?

No, there are far more references to Oids than there are definitions of them.

For example, the line in pg_operator.h:

DATA(insert OID =  15 ( "="PGNSP PGUID b t t23  20  16 416  36 
int48eq eqsel eqjoinsel ));

defines 15 as the oid for the equals operator for (int8,int4) returning bool, 
but the
fact that 23 is the Oid for int4, 20 is the Oid for int8, and 16 is the Oid for 
bool
is already defined elsewhere (int pg_type.h) and need not be duplicated here.

I'm just proposing that we don't keep specifying things by number everywhere.
Once you've established the Oid for something (operator, type, function) you
should use the name everywhere else.

mark

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Caleb Welton
Yes, that alone without any other changes would be a marked improvement and
could be implemented in many places, pg_operator is a good example.

... but there is some circularity especially with respect to type
definitions and the functions that define those types.  If you changed the
definition of prorettype into a regtype then bootstrap would try to lookup
the type before the pg_type entry exists and throw a fit.  That's handled
in SQL via shell types.  If we wanted bootstrap to be able to handle this
then we'd have to make two passes of pg_type, the first to create the
shells and the second to handle populating the serialization functions.

Unfortunately types and functions tend to be the more volatile areas of the
catalog so this particular circularity is particularly vexing.

On Fri, Dec 11, 2015 at 2:53 PM, Mark Dilger 
wrote:

>
> > On Dec 11, 2015, at 2:40 PM, Tom Lane  wrote:
> >
> > Mark Dilger  writes:
> >>> On Dec 11, 2015, at 1:46 PM, Tom Lane  wrote:
> >>> That's an interesting proposal.  It would mean that the catalog files
> >>> stay at more or less their current semantic level (direct
> representations
> >>> of bootstrap catalog contents), but it does sound like a more
> attractive
> >>> way to perform complex edits than writing Emacs macros ;-).
> >
> >> I would be happy to work on this, if there is much chance of the
> community
> >> accepting a patch.  Do you think replacing the numeric Oids for
> functions,
> >> operators, opclasses and such in the source files with their names would
> >> be ok, with the SQL converting those to Oids in the output?
> >
> > Huh?  Those files are the definition of that mapping, no?  Isn't what
> > you're proposing circular?
>
> No, there are far more references to Oids than there are definitions of
> them.
>
> For example, the line in pg_operator.h:
>
> DATA(insert OID =  15 ( "="PGNSP PGUID b t t23  20  16 416  36
> int48eq eqsel eqjoinsel ));
>
> defines 15 as the oid for the equals operator for (int8,int4) returning
> bool, but the
> fact that 23 is the Oid for int4, 20 is the Oid for int8, and 16 is the
> Oid for bool
> is already defined elsewhere (int pg_type.h) and need not be duplicated
> here.
>
> I'm just proposing that we don't keep specifying things by number
> everywhere.
> Once you've established the Oid for something (operator, type, function)
> you
> should use the name everywhere else.
>
> mark


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Tom Lane
Mark Dilger  writes:
>> On Dec 11, 2015, at 2:40 PM, Tom Lane  wrote:
>> Huh?  Those files are the definition of that mapping, no?  Isn't what
>> you're proposing circular?

> No, there are far more references to Oids than there are definitions of them.

Well, you're still not being very clear, but I *think* what you're
proposing is to put a lot more smarts into the script that converts
the master source files into .bki format.  That is, we might have
"=(int8,int4)" in an entry in the master source file for pg_amop, but
the script would look up that entry using the source data for pg_type
and pg_operator, and then emit a simple numeric OID into the .bki file.
(Presumably, it would know to do this because we'd redefine the
pg_amop.amopopr column as of regoperator type not plain OID.)

Yeah, that could work, though I'd be a bit concerned about the complexity
and speed of the script.  Still, one doesn't usually rebuild postgres.bki
many times a day, so speed might not be a big problem.

This seems more or less orthogonal to the question of whether to get rid
of the DATA() lines in favor of a COPY-friendly data format.  I'd suggest
treating those as separate patches.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Tom Lane
Alvaro Herrera  writes:
> Crazy idea: we could just have a CSV file which can be loaded into a
> table for mass changes using regular DDL commands, then dumped back from
> there into the file.  We already know how to do these things, using
> \copy etc.  Since CSV uses one line per entry, there would be no merge
> problems either (or rather: all merge problems would become conflicts,
> which is what we want.)

That's an interesting proposal.  It would mean that the catalog files
stay at more or less their current semantic level (direct representations
of bootstrap catalog contents), but it does sound like a more attractive
way to perform complex edits than writing Emacs macros ;-).

You could actually do that the hard way right now, with a bit of script
to convert between DATA lines and CSV format.  But if we anticipate that
becoming the standard approach, it would definitely make sense to migrate
the master copies into CSV or traditional COPY format, and teach BKI mode
to read that (or, perhaps, leave bootstrap.c alone and modify the code
that produces the .bki file).

This is somewhat orthogonal to the question of whether we want to do
things like converting noncritical operator-class definitions into
regular CREATE OPERATOR CLASS syntax.  There's almost certainly going
to be some hard core of catalog entries that aren't amenable to that,
and will still need to be loaded from data files of some sort.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Andres Freund
On 2015-12-11 18:12:16 -0500, Tom Lane wrote:
> I think what Mark is proposing is to do the lookups while preparing the
> .bki file, which would eliminate the circularity ... at the cost of having
> to, essentially, reimplement regprocedure_in and friends in Perl.

FWIW, I did that, when this came up last. Rather interesting, because it
leads to rather noticeable speedups - currently initdb spents a
significant amount of its time doing reproc lookups. Especially
interesting because at that stage we're largely not using indexes yet, IIRC.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Tom Lane
Caleb Welton  writes:
> ... but there is some circularity especially with respect to type
> definitions and the functions that define those types.  If you changed the
> definition of prorettype into a regtype then bootstrap would try to lookup
> the type before the pg_type entry exists and throw a fit.  That's handled
> in SQL via shell types.  If we wanted bootstrap to be able to handle this
> then we'd have to make two passes of pg_type, the first to create the
> shells and the second to handle populating the serialization functions.

I think what Mark is proposing is to do the lookups while preparing the
.bki file, which would eliminate the circularity ... at the cost of having
to, essentially, reimplement regprocedure_in and friends in Perl.

If we push hard on doing the other thing that you're proposing, which is
to take as much as possible out of the pure bootstrap-data phase, then
maybe it wouldn't be worth the work to do that.  Not sure.

On the other hand, I'm not very much in love with the thought of having
two different notations for "core" and "not so core" built-in function
creation.  There's something to be said for keeping all the data in one
format.  If we push on making the .bki creation script smarter, then in
addition to the name lookup facilities Mark envisions, we could have
things like default column values.  That would take us a long way toward
the same ease-of-use as full SQL definitions.  We'd still be lacking
some error checks that the SQL commands could perform; but we've
traditionally used sanity checks in the regression tests to do
cross-checking that covers more or less those same bases.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Mark Dilger

> On Dec 11, 2015, at 3:02 PM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>>> On Dec 11, 2015, at 2:40 PM, Tom Lane  wrote:
>>> Huh?  Those files are the definition of that mapping, no?  Isn't what
>>> you're proposing circular?
> 
>> No, there are far more references to Oids than there are definitions of them.
> 
> Well, you're still not being very clear, but I *think* what you're
> proposing is to put a lot more smarts into the script that converts
> the master source files into .bki format.  That is, we might have
> "=(int8,int4)" in an entry in the master source file for pg_amop, but
> the script would look up that entry using the source data for pg_type
> and pg_operator, and then emit a simple numeric OID into the .bki file.
> (Presumably, it would know to do this because we'd redefine the
> pg_amop.amopopr column as of regoperator type not plain OID.)
> 
> Yeah, that could work, though I'd be a bit concerned about the complexity
> and speed of the script.  Still, one doesn't usually rebuild postgres.bki
> many times a day, so speed might not be a big problem.

I am proposing that each of the catalog headers that currently has DATA
lines instead have a COPY loadable file that contains the same information.
So, for pg_type.h, there would be a pg_type.dat file.  All the DATA lines
would be pulled out of pg_type.h and a corresponding tab delimited row
would be written to pg_type.dat.  Henceforth, if you cloned the git repository,
you'd find no DATA lines in pg_type.h, but would find a pg_type.dat file
in the src/include/catalog directory.  Likewise for the other header files.

There would be some script, SQL or perl or whatever, that would convert
these .dat files into the .bki file.

Now, if we know that pg_type.dat will be processed before pg_proc.dat,
we can replace all the Oids representing datatypes in pg_proc.dat with the
names for those types, given that we already have a name <=> oid
mapping for types.

Likewise, if we know that pg_proc.dat will be processed before pg_operator.dat,
we can specify both functions and datatypes by name rather than by Oid
in that file, making it much easier to read.  By the time pg_operator.dat is
read, pg_type.dat and pg_proc.dat will already have been read and processed,
so there shouldn't be ambiguity. 

By the time pg_amop.dat is processed, the operators, procs, datatypes,
opfamilies and so forth would already be know.  The example I gave up
thread would be easy to parse:

amopfamily  amoplefttypeamoprighttype   amopstrategy
amoppurpose amopopr amopmethod  amopsortfamily
integer_ops int2int21   
search  "<" btree   0
integer_ops int2int22   
search  "<="btree   0
integer_ops int2int23   
search  "=" btree   0
integer_ops int2int24   
search  ">="btree   0
integer_ops int2int25   
search  ">" btree   0

And if I came along and defined a new datatype, int384, I could add rows to
this file much more easily, as:

amopfamily  amoplefttypeamoprighttype   amopstrategy
amoppurpose amopopr amopmethod  amopsortfamily
integer_ops int384int3841   
search  "<" btree   0
integer_ops int384int3842   
search  "<="btree   0
integer_ops int384int3843   
search  "=" btree   0
integer_ops int384int3844   
search  ">="btree   0
integer_ops int384int3845   
search  ">" btree   0

I don't see how this creates all that much complication, and I clearly see
how it makes files like pg_operator.{h,dat} and pg_amop.{h,dat} easier to read.


mark



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Andres Freund
On 2015-12-11 19:26:38 -0500, Tom Lane wrote:
> I believe it's soluble, but it's going to take something more like
> loading up all the data at once and then doing lookups as we write
> out the .bki entries for each catalog.  Fortunately, the volume of
> bootstrap data is small enough that that won't be a problem on any
> machine capable of running modern Postgres ...

I think that's exactly the right approach. Just building a few perl
hashes worked well enough, in my prototype of that.

If additionally a few more plain oid fields are converted into reg*
types, the source data fields are easier to understand and the catalogs
get much nicer to query...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Tom Lane
Mark Dilger  writes:
> Now, if we know that pg_type.dat will be processed before pg_proc.dat,
> we can replace all the Oids representing datatypes in pg_proc.dat with the
> names for those types, given that we already have a name <=> oid
> mapping for types.

I don't think this is quite as simple as you paint it.  How can you
process pg_type.dat first, when it contains pg_proc references?  Doing
pg_proc first is no better, because it contains pg_type references.

I believe it's soluble, but it's going to take something more like
loading up all the data at once and then doing lookups as we write
out the .bki entries for each catalog.  Fortunately, the volume of
bootstrap data is small enough that that won't be a problem on any
machine capable of running modern Postgres ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Tom Lane
Andres Freund  writes:
> On 2015-12-11 19:26:38 -0500, Tom Lane wrote:
>> I believe it's soluble, but it's going to take something more like
>> loading up all the data at once and then doing lookups as we write
>> out the .bki entries for each catalog.  Fortunately, the volume of
>> bootstrap data is small enough that that won't be a problem on any
>> machine capable of running modern Postgres ...

> I think that's exactly the right approach. Just building a few perl
> hashes worked well enough, in my prototype of that.

Right.  I would draw Mark's attention to src/backend/catalog/Catalog.pm
and the things that use that.  Presumably all that would have be
rewritten, but the existing code would be a useful starting point
perhaps.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Alvaro Herrera
Caleb Welton wrote:
> I'm happy working these ideas forward if there is interest.
> 
> Basic design proposal is:
>   - keep a minimal amount of bootstrap to avoid intrusive changes to core
> components
>   - Add capabilities of creating objects with specific OIDs via DDL during
> initdb
>   - Update the caching/resolution mechanism for builtin functions to be
> more dynamic.
>   - Move as much of bootstrap as possible into SQL files and create catalog
> via DDL

I think the point we got stuck last time at was deciding on a good
format for the data coming from the DATA lines.  One of the objections
raised for formats such as JSON is that it's trivial for "git merge" (or
similar tools) to make a mistake because object-end/object-start lines
are all identical.  And as for the SQL-format version, the objection was
that it's hard to modify the lines en-masse when modifying the catalog
definition (new column, etc).  Ideally we would like a format that can
be bulk-edited without too much trouble.

A SQL file would presumably not have the merge issue, but mass-editing
would be a pain.

Crazy idea: we could just have a CSV file which can be loaded into a
table for mass changes using regular DDL commands, then dumped back from
there into the file.  We already know how to do these things, using
\copy etc.  Since CSV uses one line per entry, there would be no merge
problems either (or rather: all merge problems would become conflicts,
which is what we want.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Caleb Welton
I'm happy working these ideas forward if there is interest.

Basic design proposal is:
  - keep a minimal amount of bootstrap to avoid intrusive changes to core
components
  - Add capabilities of creating objects with specific OIDs via DDL during
initdb
  - Update the caching/resolution mechanism for builtin functions to be
more dynamic.
  - Move as much of bootstrap as possible into SQL files and create catalog
via DDL

Feedback appreciated.

I can provide a sample patch if there is interest, about ~500 lines of
combined diff for the needed infrastructure to support the above, not
including the modifications to pg_proc.h that would follow.

Thanks,
  Caleb

On Thu, Dec 10, 2015 at 11:47 AM, Caleb Welton wrote:
>
>
> Hello Hackers,
>
>   Reviving an old thread on simplifying the bootstrap process.
>
>   I'm a developer from the GPDB / HAWQ side of the world where we did some
> work a while back to enable catalog definition via SQL files and we have
> found it valuable from a dev perspective.  The mechanism currently in those
> products is a bit.. convoluted where SQL is processed in perl to create the
> existing DATA statements, which are then processed as they are today in
> Postgres... I wouldn't suggest this route, but having worked with both the
> DATA mechanism and the SQL based one I've certainly found SQL to be a more
> convenient way of interacting with the catalog.
>
>   I'd propose:
>  - Keep enough of the existing bootstrap mechanism functional to get a
> small tidy core, essentially you need enough of pg_type, pg_proc, pg_class,
> pg_attribute to support the 25 types used by catalog tables and most
> everything else can be moved into SQL processing like how system_views.sql
> is handled today.
>
>   The above was largely proposed back in March and rejected based on
> concerns that
>
>   1. initdb would be slower.
>   2. It would introduce too much special purpose bootstrap cruft into the
> code.
>   3. Editing SQL commands is not comfortable in bulk
>
> On 1.
>
> I have a prototype that handles about 1000 functions (all the functions in
> pg_proc.h that are not used by other catalog tables, e.g. pg_type,
> pg_language, pg_range, pg_aggregate, window functions, pg_ts_parser, etc).
>
> All of initdb can be processed in 1.53s. This compares to 1.37s with the
> current bootstrap approach.  So yes, this is slower, but not 'noticeably
> slower' - I certainly didn't notice the 0.16s until I saw the concern and
> then timed it.
>
> On 2.
>
> So far the amount of cruft has been:
>   - Enabling adding functions with specific OIDs when creating functions.
> 1 line changes in pg_aggregate.c, proclang.c, typecmds.c
> about dozen lines of code in functioncmds.c
> 3 lines changed in pg_proc.c
>   - Update the fmgr_internal_validator for builtin functions while the
> catalog is mutable
> 3 lines changed in pg_proc.c
>   - Update how the builtin function cache is built
> Some significant work in fmgr.c that honestly still needs cleanup
> before it would be ready to propose as a patch that would be worthy of
> committing.
>   - Update how builtin functions are resolved outside of bootstrap
> Minor updates to dynloader for lookup of symbols within the current
> executable, so far I've only done darwin.c for my prototype, this would
> need to be extended to the other ports.
>   - Initializitation of the builtin cache
> 2 line change in postinit.c
>   - Addition of a stage in initdb to process the sql directives similar in
> scope to the processing of system_views.sql.
>
> No changes needed in the parser, planner, etc.  My assessment is that this
> worry is not a major concern in practice with the right implementation.
>
> On 3.
>
> Having worked with both SQL and bki DATA directives I have personally found
> the convenience of SQL outweighs the pain.  In many cases changes, such as
> adding a new column to pg_proc, have minimal impact on the SQL
> representation and what changes are needed are often simple to implement.
> E.g. accounting for COST only needs to be done for the functions that need
> something other than the default value.  This however is somewhat
> subjective.
>
> On the Pros side:
>
>   a. Debugging bootstrap is extremely painful, debugging once initdb has
> gotten to 'postgres --single' is way easier.
>
>   b. It is easier to introduce minor issues with DATA directives than it is
> when using the SQL processing used for all other user objects.
>
>Example: currently in Postgres all builtin functions default to COST 1,
> and all SQL functions default to cost 100. However the following SQL
> functions included in bootstrap inexplicably are initialized with a COST of
> 1:
>age(timestamp with time zone)
>age(timestamp without time zone)
>bit_length(bytea)
>bit_length(text)
>bit_length(bit)
>date_part(text, abstime)
>date_part(text, reltime)
>date_part(text, date)
>... and 26 other examples
>
>   c. SQL files are significantly 

Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Mark Dilger

> On Dec 11, 2015, at 2:54 PM, Caleb Welton  wrote:
> 
> The current semantic level is pretty low level, somewhat cumbersome, and 
> requires filling in values that most of the time the system has a pretty good 
> idea how to fill in default values.
> 
> Compare:
> CREATE FUNCTION lo_export(oid, text) RETURNS integer LANGUAGE internal STRICT 
> AS 'lo_export' WITH (OID=765);  
> 
> DATA(insert OID = 765 (  lo_export   PGNSP PGUID 12 1 0 0 0 f f f 
> f t f v u 2 0 23 "26 25" _null_ _null_ _null_ _null_ _null_ lo_export _null_ 
> _null_ _null_ ));

I would like to hear more about this idea.  Are you proposing that we use 
something
like the above CREATE FUNCTION format to express what is currently being 
expressed
with DATA statements?  That is an interesting idea, though I don't know what 
exactly
that would look like.  If you want to forward this idea, I'd be eager to hear 
your thoughts.
If not, I'll try to make progress with my idea of tab delimited files and such 
(or really,
Alvaro's idea of csv files that I only slightly corrupted).

mark



Re: [HACKERS] Bootstrap DATA is a pita

2015-12-11 Thread Caleb Welton
Makes sense.  

During my own prototyping what I did was generate the sql statements via sql 
querying the existing catalog.  Way easier than hand writing 1000+ function 
definitions and not difficult to modify for future changes.  As affirmed that 
it was very easy to adapt my existing sql to account for some of the newer 
features in master. 

The biggest challenge was establishing a sort order that ensures both a unique 
ordering and that the dependencies needed for SQL functions have been processed 
before trying to define them.  Which effects about 4/1000 functions based on a 
natural oid ordering.  

> On Dec 11, 2015, at 11:43 AM, Alvaro Herrera  wrote:
> 
> Caleb Welton wrote:
>> I'm happy working these ideas forward if there is interest.
>> 
>> Basic design proposal is:
>>  - keep a minimal amount of bootstrap to avoid intrusive changes to core
>> components
>>  - Add capabilities of creating objects with specific OIDs via DDL during
>> initdb
>>  - Update the caching/resolution mechanism for builtin functions to be
>> more dynamic.
>>  - Move as much of bootstrap as possible into SQL files and create catalog
>> via DDL
> 
> I think the point we got stuck last time at was deciding on a good
> format for the data coming from the DATA lines.  One of the objections
> raised for formats such as JSON is that it's trivial for "git merge" (or
> similar tools) to make a mistake because object-end/object-start lines
> are all identical.  And as for the SQL-format version, the objection was
> that it's hard to modify the lines en-masse when modifying the catalog
> definition (new column, etc).  Ideally we would like a format that can
> be bulk-edited without too much trouble.
> 
> A SQL file would presumably not have the merge issue, but mass-editing
> would be a pain.
> 
> Crazy idea: we could just have a CSV file which can be loaded into a
> table for mass changes using regular DDL commands, then dumped back from
> there into the file.  We already know how to do these things, using
> \copy etc.  Since CSV uses one line per entry, there would be no merge
> problems either (or rather: all merge problems would become conflicts,
> which is what we want.)
> 
> -- 
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-10 Thread Caleb Welton
Hello Hackers,

  Reviving an old thread on simplifying the bootstrap process.

  I'm a developer from the GPDB / HAWQ side of the world where we did some
work a while back to enable catalog definition via SQL files and we have
found it valuable from a dev perspective.  The mechanism currently in those
products is a bit.. convoluted where SQL is processed in perl to create the
existing DATA statements, which are then processed as they are today in
Postgres... I wouldn't suggest this route, but having worked with both the
DATA mechanism and the SQL based one I've certainly found SQL to be a more
convenient way of interacting with the catalog.

  I'd propose:
 - Keep enough of the existing bootstrap mechanism functional to get a
small tidy core, essentially you need enough of pg_type, pg_proc, pg_class,
pg_attribute to support the 25 types used by catalog tables and most
everything else can be moved into SQL processing like how system_views.sql
is handled today.

  The above was largely proposed back in March and rejected based on
concerns that

  1. initdb would be slower.
  2. It would introduce too much special purpose bootstrap cruft into the
code.
  3. Editing SQL commands is not comfortable in bulk

On 1.

I have a prototype that handles about 1000 functions (all the functions in
pg_proc.h that are not used by other catalog tables, e.g. pg_type,
pg_language, pg_range, pg_aggregate, window functions, pg_ts_parser, etc).

All of initdb can be processed in 1.53s. This compares to 1.37s with the
current bootstrap approach.  So yes, this is slower, but not 'noticeably
slower' - I certainly didn't notice the 0.16s until I saw the concern and
then timed it.

On 2.

So far the amount of cruft has been:
  - Enabling adding functions with specific OIDs when creating functions.
1 line changes in pg_aggregate.c, proclang.c, typecmds.c
about dozen lines of code in functioncmds.c
3 lines changed in pg_proc.c
  - Update the fmgr_internal_validator for builtin functions while the
catalog is mutable
3 lines changed in pg_proc.c
  - Update how the builtin function cache is built
Some significant work in fmgr.c that honestly still needs cleanup
before it would be ready to propose as a patch that would be worthy of
committing.
  - Update how builtin functions are resolved outside of bootstrap
Minor updates to dynloader for lookup of symbols within the current
executable, so far I've only done darwin.c for my prototype, this would
need to be extended to the other ports.
  - Initializitation of the builtin cache
2 line change in postinit.c
  - Addition of a stage in initdb to process the sql directives similar in
scope to the processing of system_views.sql.

No changes needed in the parser, planner, etc.  My assessment is that this
worry is not a major concern in practice with the right implementation.

On 3.

Having worked with both SQL and bki DATA directives I have personally found
the convenience of SQL outweighs the pain.  In many cases changes, such as
adding a new column to pg_proc, have minimal impact on the SQL
representation and what changes are needed are often simple to implement.
E.g. accounting for COST only needs to be done for the functions that need
something other than the default value.  This however is somewhat
subjective.

On the Pros side:

  a. Debugging bootstrap is extremely painful, debugging once initdb has
gotten to 'postgres --single' is way easier.

  b. It is easier to introduce minor issues with DATA directives than it is
when using the SQL processing used for all other user objects.

   Example: currently in Postgres all builtin functions default to COST 1,
and all SQL functions default to cost 100. However the following SQL
functions included in bootstrap inexplicably are initialized with a COST of
1:
   age(timestamp with time zone)
   age(timestamp without time zone)
   bit_length(bytea)
   bit_length(text)
   bit_length(bit)
   date_part(text, abstime)
   date_part(text, reltime)
   date_part(text, date)
   ... and 26 other examples

  c. SQL files are significantly less of a PITA (subjective opinion, but I
can say this from a perspective of experience working with both DATA
directives and SQL driven catalog definition).

If people have interest I can share my patch so far if that helps address
concerns, but if there is not interest then I'll probably leave my
prototype where it is rather than investing more effort in the proof of
concept.

Thanks,
  Caleb


On Sat, Mar 7, 2015 at 5:20 PM,  wrote:

> Date: Sat, 7 Mar 2015 23:46:54 +0100
> From: Andres Freund 
> To: Jim Nasby 
> Cc: Stephen Frost , Robert Haas  >,
> Tom Lane , Peter Eisentraut ,
> Josh Berkus ,
> "pgsql-hackers@postgresql.org" 
> Subject: Re: Bootstrap 

Re: [HACKERS] Bootstrap DATA is a pita

2015-03-09 Thread Robert Haas
On Sun, Mar 8, 2015 at 12:35 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-03-04 10:25:58 -0500, Robert Haas wrote:
 Another advantage of this is that it would probably make git less
 likely to fumble a rebase.  If there are lots of places in the file
 where we have the same 10 lines in a row with occasional variations,
 rebasing a patch could easily pick the the wrong place to reapply the
 hunk.  I would personally consider a substantial increase in the rate
 of such occurrences as being a cure far, far worse than the disease.
 If you keep the entry for each function on just a couple of lines the
 chances of this happening are greatly reduced, because you're much
 likely to get a false match to surrounding context.

 I'm not particularly worried about this. Especially with attribute
 defaults it seems unlikely that you often have the same three
 surrounding lines in both directions in a similar region of the file.

That's woefully optimistic, and you don't need to have 3 lines.  1 or
2 will do fine.

 And even if it turns out to actually be bothersome, you can help
 yourself by passing -U 5/setting diff.context = 5 or something like
 that.

I don't believe that for a minute.  When you have your own private
branch and you do 'git rebase master', how's that going to help?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-09 Thread Andres Freund
On 2015-03-07 18:09:36 -0600, Jim Nasby wrote:
 How often does a normal user actually initdb? I don't think it's that
 incredibly common. Added time to our development cycle certainly is a
 concern though.

There's many shops that run initdb as part of their test/CI systems.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-08 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 * Andrew Dunstan (and...@dunslane.net) wrote:
 On 03/07/2015 05:46 PM, Andres Freund wrote:
 On 2015-03-07 16:43:15 -0600, Jim Nasby wrote:
 Semi-related... if we put some special handling in some places for 
 bootstrap
 mode, couldn't most catalog objects be created using SQL, once we got
 pg_class, pg_attributes and pg_type created?

 Several people have now made that suggestion, but I *seriously* doubt
 that we actually want to go there. The overhead of executing SQL
 commands in comparison to the bki stuff is really rather
 noticeable. Doing the majority of the large number of insertions via SQL
 will make initdb noticeably slower. And it's already annoyingly
 slow. Besides make install it's probably the thing I wait most for
 during development.

 My reaction exactly. We should not make users pay a price for
 developers' convenience.

Another reason not to do this is that it would require a significant (in
my judgment) amount of crapification of a lot of code with bootstrap-mode
special cases.  Neither the parser, the planner, nor the executor could
function in bootstrap mode without a lot of lobotomization.  Far better
to confine all that ugliness to bootstrap.c.

 Just to clarify, since Jim was responding to my comment, my thought was
 *not* to use SQL commands inside initdb, but rather to use PG to create
 the source files that we have today in our tree, which wouldn't slow
 down initdb at all.

That, on the other hand, might be a sane suggestion.  I'm not sure
though.  It feels more like use the hammer you have at hand than
necessarily being a good fit.  In particular, keeping the raw data in
some tables doesn't seem like an environment that would naturally
distinguish between hard-coded and defaultable values.  For instance,
how would you distinguish hard-coded OIDs from ones that could be
assigned at initdb's whim?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-08 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2015-03-04 10:25:58 -0500, Robert Haas wrote:
 Another advantage of this is that it would probably make git less
 likely to fumble a rebase.  If there are lots of places in the file
 where we have the same 10 lines in a row with occasional variations,
 rebasing a patch could easily pick the the wrong place to reapply the
 hunk.  I would personally consider a substantial increase in the rate
 of such occurrences as being a cure far, far worse than the disease.
 If you keep the entry for each function on just a couple of lines the
 chances of this happening are greatly reduced, because you're much
 likely to get a false match to surrounding context.

 I'm not particularly worried about this. Especially with attribute
 defaults it seems unlikely that you often have the same three
 surrounding lines in both directions in a similar region of the file.

Really?  A lot depends on the details of how we choose to lay out these
files, but you could easily blow all your safety margin on lines
containing just braces, for instance.

I'll reserve judgment on this till I see the proposed new catalog data
files, but I absolutely reject any contention that it's not something
to worry about.

 And even if it turns out to actually be bothersome, you can help
 yourself by passing -U 5/setting diff.context = 5 or something like
 that.

Um.  Good luck with getting every patch submitter to do that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-08 Thread Alvaro Herrera
Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:

  And even if it turns out to actually be bothersome, you can help
  yourself by passing -U 5/setting diff.context = 5 or something like
  that.
 
 Um.  Good luck with getting every patch submitter to do that.

Can we do it centrally somehow?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-08 Thread Andrew Dunstan


On 03/08/2015 10:11 PM, Alvaro Herrera wrote:

Tom Lane wrote:

Andres Freund and...@2ndquadrant.com writes:

And even if it turns out to actually be bothersome, you can help
yourself by passing -U 5/setting diff.context = 5 or something like
that.

Um.  Good luck with getting every patch submitter to do that.

Can we do it centrally somehow?




I don't believe there is provision for setting diff.context on a per 
file basis.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-08 Thread Andres Freund
On 2015-03-04 10:25:58 -0500, Robert Haas wrote:
 Another advantage of this is that it would probably make git less
 likely to fumble a rebase.  If there are lots of places in the file
 where we have the same 10 lines in a row with occasional variations,
 rebasing a patch could easily pick the the wrong place to reapply the
 hunk.  I would personally consider a substantial increase in the rate
 of such occurrences as being a cure far, far worse than the disease.
 If you keep the entry for each function on just a couple of lines the
 chances of this happening are greatly reduced, because you're much
 likely to get a false match to surrounding context.

I'm not particularly worried about this. Especially with attribute
defaults it seems unlikely that you often have the same three
surrounding lines in both directions in a similar region of the file.

And even if it turns out to actually be bothersome, you can help
yourself by passing -U 5/setting diff.context = 5 or something like
that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-07 Thread Andres Freund
On 2015-03-07 16:43:15 -0600, Jim Nasby wrote:
 Semi-related... if we put some special handling in some places for bootstrap
 mode, couldn't most catalog objects be created using SQL, once we got
 pg_class, pg_attributes and pg_type created? That would theoretically allow
 us to drive much more of initdb with plain SQL (possibly created via
 pg_dump).

Several people have now made that suggestion, but I *seriously* doubt
that we actually want to go there. The overhead of executing SQL
commands in comparison to the bki stuff is really rather
noticeable. Doing the majority of the large number of insertions via SQL
will make initdb noticeably slower. And it's already annoyingly
slow. Besides make install it's probably the thing I wait most for
during development.

That's besides the fact that SQL commands aren't actually that
comfortably editable in bulk.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-07 Thread Jim Nasby

On 3/4/15 9:07 AM, Stephen Frost wrote:

* Robert Haas (robertmh...@gmail.com) wrote:

On Wed, Mar 4, 2015 at 9:42 AM, Tom Lane t...@sss.pgh.pa.us wrote:

and make it harder to compare entries by grepping out some common
substring.



Could you give an example of the sort of thing you wish to do?


On that angle, I'm dubious that a format that allows omission of fields is
going to be easy for editing scripts to modify, no matter what the layout
convention is.  I've found it relatively easy to write sed or even Emacs
macros to add new column values to old-school pg_proc.h ... but in this
brave new world, I'm going to be really hoping that the column default
works for 99.9% of pg_proc entries when we add a new pg_proc column,
because slipping a value into a desired position is gonna be hard for
a script when you don't know whether the adjacent existing fields are
present or not.


I wonder if we should have a tool in our repository to help people
edit the file.  So instead of going in there yourself and changing
things by hand, or writing your own script, you can do:

updatepgproc.pl --oid 5678 provolatile=v

or

updatepgpproc.pl --name='.*xact.*' prowhatever=someval

Regardless of what format we end up with, that seems like it would
make things easier.


Alright, I'll bite on this- we have this really neat tool for editing
data in bulk, or individual values, or pulling out data to look at based
on particular values or even functions...  It's called PostgreSQL.

What if we had an easy way to export an existing table into whatever
format we decide to use for initdb to use?  For that matter, what if
that file was simple to import into PG?

What about having a way to load all the catalog tables from their git
repo files into a pg_dev schema?  Maybe even include a make target or
initdb option which does that?  (the point here being to provide a way
to modify the tables and compare the results to the existing tables
without breaking the instance one is using for this)

I have to admit that I've never tried to do that with the existing
format, but seems like an interesting idea to consider.  I further
wonder if it'd be possible to generate the table structures too..


Semi-related... if we put some special handling in some places for 
bootstrap mode, couldn't most catalog objects be created using SQL, once 
we got pg_class, pg_attributes and pg_type created? That would 
theoretically allow us to drive much more of initdb with plain SQL 
(possibly created via pg_dump).

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-07 Thread Stephen Frost
* Andrew Dunstan (and...@dunslane.net) wrote:
 On 03/07/2015 05:46 PM, Andres Freund wrote:
 On 2015-03-07 16:43:15 -0600, Jim Nasby wrote:
 Semi-related... if we put some special handling in some places for bootstrap
 mode, couldn't most catalog objects be created using SQL, once we got
 pg_class, pg_attributes and pg_type created? That would theoretically allow
 us to drive much more of initdb with plain SQL (possibly created via
 pg_dump).
 Several people have now made that suggestion, but I *seriously* doubt
 that we actually want to go there. The overhead of executing SQL
 commands in comparison to the bki stuff is really rather
 noticeable. Doing the majority of the large number of insertions via SQL
 will make initdb noticeably slower. And it's already annoyingly
 slow. Besides make install it's probably the thing I wait most for
 during development.
 
 My reaction exactly. We should not make users pay a price for
 developers' convenience.

Just to clarify, since Jim was responding to my comment, my thought was
*not* to use SQL commands inside initdb, but rather to use PG to create
the source files that we have today in our tree, which wouldn't slow
down initdb at all.

 That's besides the fact that SQL commands aren't actually that
 comfortably editable in bulk.
 
 Indeed.

No, they aren't, but having the data in a table in PG, with a way to
easily export to the format needed by BKI, would make bulk updates much
easier..

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-07 Thread Jim Nasby

On 3/7/15 6:02 PM, Stephen Frost wrote:

* Andrew Dunstan (and...@dunslane.net) wrote:

On 03/07/2015 05:46 PM, Andres Freund wrote:

On 2015-03-07 16:43:15 -0600, Jim Nasby wrote:

Semi-related... if we put some special handling in some places for bootstrap
mode, couldn't most catalog objects be created using SQL, once we got
pg_class, pg_attributes and pg_type created? That would theoretically allow
us to drive much more of initdb with plain SQL (possibly created via
pg_dump).

Several people have now made that suggestion, but I *seriously* doubt
that we actually want to go there. The overhead of executing SQL
commands in comparison to the bki stuff is really rather
noticeable. Doing the majority of the large number of insertions via SQL
will make initdb noticeably slower. And it's already annoyingly
slow. Besides make install it's probably the thing I wait most for
during development.


My reaction exactly. We should not make users pay a price for
developers' convenience.


How often does a normal user actually initdb? I don't think it's that 
incredibly common. Added time to our development cycle certainly is a 
concern though.



Just to clarify, since Jim was responding to my comment, my thought was
*not* to use SQL commands inside initdb, but rather to use PG to create
the source files that we have today in our tree, which wouldn't slow
down initdb at all.


Yeah, I was thinking SQL would make it even easier, but perhaps not. 
Since the other options here seem to have hit a dead end though, it 
seems your load it into tables idea is what we've got left...



That's besides the fact that SQL commands aren't actually that
comfortably editable in bulk.


Indeed.


No, they aren't, but having the data in a table in PG, with a way to
easily export to the format needed by BKI, would make bulk updates much
easier..


My thought was that pg_dump would be useful here, so instead of hand 
editing you'd just make changes in a live database and then dump it.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-07 Thread Andrew Dunstan


On 03/07/2015 05:46 PM, Andres Freund wrote:

On 2015-03-07 16:43:15 -0600, Jim Nasby wrote:

Semi-related... if we put some special handling in some places for bootstrap
mode, couldn't most catalog objects be created using SQL, once we got
pg_class, pg_attributes and pg_type created? That would theoretically allow
us to drive much more of initdb with plain SQL (possibly created via
pg_dump).

Several people have now made that suggestion, but I *seriously* doubt
that we actually want to go there. The overhead of executing SQL
commands in comparison to the bki stuff is really rather
noticeable. Doing the majority of the large number of insertions via SQL
will make initdb noticeably slower. And it's already annoyingly
slow. Besides make install it's probably the thing I wait most for
during development.



My reaction exactly. We should not make users pay a price for 
developers' convenience.




That's besides the fact that SQL commands aren't actually that
comfortably editable in bulk.


Indeed.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-06 Thread Alvaro Herrera
Robert Haas wrote:
 On Wed, Mar 4, 2015 at 2:27 PM, Alvaro Herrera alvhe...@2ndquadrant.com 
 wrote:

  BTW one solution to the merge problem is to have unique separators for
  each entry.  For instance, instead of

 Speaking from entirely too much experience, that's not nearly enough.
 git only needs 3 lines of context to apply a hunk with no qualms at
 all, and it'll shade that to just 1 or 2 with little fanfare.  If your
 pg_proc entries are each 20 lines long, this sort of thing will
 provide little protection.

Yeah, you're right.  This is going to be a problem, and we need some
solution for it.  I'm out of ideas, other than of course getting each
entry to be at most two lines long which nobody seems to like (for good
reasons.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Another advantage of this is that it would probably make git less
 likely to fumble a rebase.  If there are lots of places in the file
 where we have the same 10 lines in a row with occasional variations,
 rebasing a patch could easily pick the the wrong place to reapply the
 hunk.

That is a really, really good point.

I had been thinking it was a disadvantage of Andrew's proposal that
line breaks would tend to fall in inconsistent places from one entry
to another ... but from this perspective, maybe that's not such a
bad thing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Andres Freund
On 2015-03-04 09:55:01 -0500, Robert Haas wrote:
 On Wed, Mar 4, 2015 at 9:42 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder if we should have a tool in our repository to help people
 edit the file.  So instead of going in there yourself and changing
 things by hand, or writing your own script, you can do:
 
 updatepgproc.pl --oid 5678 provolatile=v
 
 or
 
 updatepgpproc.pl --name='.*xact.*' prowhatever=someval
 
 Regardless of what format we end up with, that seems like it would
 make things easier.

The stuff I've started to work on basically allows to load the stuff
that Catalog.pm provides (in an extended format), edit it in memory, and
then serialize it again. So such a thing could relatively easily be
added if somebody wants to do so.  I sure hope though that the need for
it will become drastically lower with the new format.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Alvaro Herrera
Andrew Dunstan wrote:
 
 On 03/04/2015 09:51 AM, Robert Haas wrote:
 On Wed, Mar 4, 2015 at 9:06 AM, Peter Eisentraut pete...@gmx.net wrote:
 and make it harder to compare entries by grepping out some common
 substring.
 Could you give an example of the sort of thing you wish to do?
 e.g. grep for a function name and check that all the matches have the
 same volatility.
 
 I think grep will be the wrong tool for this format, but if we're settling
 on a perl format, a few perl one-liners should be able to work pretty well.
 It might be worth shipping a small perl module that provided some functions,
 or a script doing common tasks (or both).

I was going to say the same thing.  We need to make sure that the output
format of those oneliners is consistent, though -- it wouldn't be nice
if adding one column with nondefault value to a dozen of entries changes
the formatting of other entries.  For example, perhaps declare that the
order of entries is alphabetical or it matches something declared at the
start of the file.

From that POV, I don't like the idea of having multiple columns for a
sigle entry in a single line; adding more columns means that eventually
we're going to split lines that have become too long in a different
place, which would reformat the whole file; not very nice.  But maybe
this doesn't matter if we decree that changing the column split is a
manual chore rather than automatic, because then it can be done in a
separate mechanical commit after the extra column is added.

BTW one solution to the merge problem is to have unique separators for
each entry.  For instance, instead of

}   -- this is the end of the previous entry
,
{ 
oid = 2233,
proname = array_append,

we could have
} # array_prepend 2232
,
} # array_append 2233
oid = 2233,
proname = array_append,

where the funcname-oid comment is there to avoid busted merges.  The
automatic editing tools make sure that those markers are always present.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Robert Haas
On Wed, Mar 4, 2015 at 2:27 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Andrew Dunstan wrote:
 On 03/04/2015 09:51 AM, Robert Haas wrote:
 On Wed, Mar 4, 2015 at 9:06 AM, Peter Eisentraut pete...@gmx.net wrote:
 and make it harder to compare entries by grepping out some common
 substring.
 Could you give an example of the sort of thing you wish to do?
 e.g. grep for a function name and check that all the matches have the
 same volatility.

 I think grep will be the wrong tool for this format, but if we're settling
 on a perl format, a few perl one-liners should be able to work pretty well.
 It might be worth shipping a small perl module that provided some functions,
 or a script doing common tasks (or both).

 I was going to say the same thing.  We need to make sure that the output
 format of those oneliners is consistent, though -- it wouldn't be nice
 if adding one column with nondefault value to a dozen of entries changes
 the formatting of other entries.  For example, perhaps declare that the
 order of entries is alphabetical or it matches something declared at the
 start of the file.

 From that POV, I don't like the idea of having multiple columns for a
 sigle entry in a single line; adding more columns means that eventually
 we're going to split lines that have become too long in a different
 place, which would reformat the whole file; not very nice.  But maybe
 this doesn't matter if we decree that changing the column split is a
 manual chore rather than automatic, because then it can be done in a
 separate mechanical commit after the extra column is added.

 BTW one solution to the merge problem is to have unique separators for
 each entry.  For instance, instead of

 }   -- this is the end of the previous entry
 ,
 {
 oid = 2233,
 proname = array_append,

 we could have
 } # array_prepend 2232
 ,
 } # array_append 2233
 oid = 2233,
 proname = array_append,

 where the funcname-oid comment is there to avoid busted merges.  The
 automatic editing tools make sure that those markers are always present.

Speaking from entirely too much experience, that's not nearly enough.
git only needs 3 lines of context to apply a hunk with no qualms at
all, and it'll shade that to just 1 or 2 with little fanfare.  If your
pg_proc entries are each 20 lines long, this sort of thing will
provide little protection.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Robert Haas
On Wed, Mar 4, 2015 at 10:04 AM, Andrew Dunstan and...@dunslane.net wrote:
 Is it necessarily an all or nothing deal?

 Taking a previous example, we could have something like:

 {
  oid = 2249,  oiddefine = 'CSTRINGOID',  typname = 'cstring',
  typlen = -2, typbyval = 1,
  ...
 }

 which would allow us to fit within a reasonable edit window (for my normal
 window and font that's around 180 characters) and still reduce the number of
 lines.

 I'm not wedded to it, but it's a thought.

Another advantage of this is that it would probably make git less
likely to fumble a rebase.  If there are lots of places in the file
where we have the same 10 lines in a row with occasional variations,
rebasing a patch could easily pick the the wrong place to reapply the
hunk.  I would personally consider a substantial increase in the rate
of such occurrences as being a cure far, far worse than the disease.
If you keep the entry for each function on just a couple of lines the
chances of this happening are greatly reduced, because you're much
likely to get a false match to surrounding context.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Peter Eisentraut
On 3/4/15 9:51 AM, Robert Haas wrote:
 On Wed, Mar 4, 2015 at 9:06 AM, Peter Eisentraut pete...@gmx.net wrote:
 and make it harder to compare entries by grepping out some common
 substring.

 Could you give an example of the sort of thing you wish to do?
 
 e.g. grep for a function name and check that all the matches have the
 same volatility.

You could still do that with grep -A or something like that.  I think
that it would be easier than now.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Andres Freund
On 2015-03-03 21:49:21 -0500, Robert Haas wrote:
 On Sat, Feb 21, 2015 at 11:34 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Andres Freund and...@2ndquadrant.com writes:
  On 2015-02-20 22:19:54 -0500, Peter Eisentraut wrote:
  On 2/20/15 8:46 PM, Josh Berkus wrote:
  Or what about just doing CSV?
 
  I don't think that would actually address the problems.  It would just
  be the same format as now with different delimiters.
 
  Yea, we need hierarchies and named keys.
 
  Yeah.  One thought though is that I don't think we need the data layer
  in your proposal; that is, I'd flatten the representation to something
  more like
 
   {
   oid = 2249,
   oiddefine = 'CSTRINGOID',
   typname = 'cstring',
   typlen = -2,
   typbyval = 1,
   ...
   }
 
 Even this promises to vastly increase the number of lines in the file,
 and make it harder to compare entries by grepping out some common
 substring.  I agree that the current format is a pain in the tail, but
 pg_proc.h is 5k lines already.  I don't want it to be 100k lines
 instead.

Do you have a better suggestion? Sure it'll be a long file, but it still
seems vastly superiour to what we have now.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On 3/3/15 9:49 PM, Robert Haas wrote:
 Even this promises to vastly increase the number of lines in the file,

 I think lines are cheap.  Columns are much harder to deal with.

Yeah.  pg_proc.h is already impossible to work with in a standard
80-column window.  I don't want to find that the lines mostly wrap even
when I expand my editor window to full screen width, but that is certainly
what will happen if we adopt column labelling *and* insist that entries
remain all on one line.  (As a data point, the maximum usable Emacs window
width on my Mac laptop seems to be about 230 characters.)

It's possible that gaining the ability to depend on per-column defaults
would reduce the typical number of fields so much that pg_proc.h entries
would still fit on a line of 100-some characters ... but I'd want to see
proof before assuming that.  And pg_proc isn't even our widest catalog.
Some of the ones that are wider, like pg_am, don't seem like there would
be any scope whatsoever for saving space with per-column defaults.

So while I can see the attraction of trying to fit things on one line,
I doubt it's gonna work very well.  I'd rather go over to a
one-value-per-line format and live with lots of lines.

 and make it harder to compare entries by grepping out some common
 substring.

 Could you give an example of the sort of thing you wish to do?

On that angle, I'm dubious that a format that allows omission of fields is
going to be easy for editing scripts to modify, no matter what the layout
convention is.  I've found it relatively easy to write sed or even Emacs
macros to add new column values to old-school pg_proc.h ... but in this
brave new world, I'm going to be really hoping that the column default
works for 99.9% of pg_proc entries when we add a new pg_proc column,
because slipping a value into a desired position is gonna be hard for
a script when you don't know whether the adjacent existing fields are
present or not.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Robert Haas
On Wed, Mar 4, 2015 at 9:06 AM, Peter Eisentraut pete...@gmx.net wrote:
 and make it harder to compare entries by grepping out some common
 substring.

 Could you give an example of the sort of thing you wish to do?

e.g. grep for a function name and check that all the matches have the
same volatility.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Robert Haas
On Wed, Mar 4, 2015 at 9:42 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 and make it harder to compare entries by grepping out some common
 substring.

 Could you give an example of the sort of thing you wish to do?

 On that angle, I'm dubious that a format that allows omission of fields is
 going to be easy for editing scripts to modify, no matter what the layout
 convention is.  I've found it relatively easy to write sed or even Emacs
 macros to add new column values to old-school pg_proc.h ... but in this
 brave new world, I'm going to be really hoping that the column default
 works for 99.9% of pg_proc entries when we add a new pg_proc column,
 because slipping a value into a desired position is gonna be hard for
 a script when you don't know whether the adjacent existing fields are
 present or not.

I wonder if we should have a tool in our repository to help people
edit the file.  So instead of going in there yourself and changing
things by hand, or writing your own script, you can do:

updatepgproc.pl --oid 5678 provolatile=v

or

updatepgpproc.pl --name='.*xact.*' prowhatever=someval

Regardless of what format we end up with, that seems like it would
make things easier.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Robert Haas
 Even this promises to vastly increase the number of lines in the file,
 and make it harder to compare entries by grepping out some common
 substring.  I agree that the current format is a pain in the tail, but
 pg_proc.h is 5k lines already.  I don't want it to be 100k lines
 instead.

 Do you have a better suggestion? Sure it'll be a long file, but it still
 seems vastly superiour to what we have now.

Not really.  What had occurred to me is to try to improve the format
of the DATA lines (e.g. by allowing names to be used instead of OIDs)
but that wouldn't allow defaulted fields to be omitted, which is
certainly a big win.  I wonder whether some home-grown single-line
format might be better than using a pre-existing format, but I'm not
too sure it would.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Andres Freund
On 2015-03-04 08:47:44 -0500, Robert Haas wrote:
  Even this promises to vastly increase the number of lines in the file,
  and make it harder to compare entries by grepping out some common
  substring.  I agree that the current format is a pain in the tail, but
  pg_proc.h is 5k lines already.  I don't want it to be 100k lines
  instead.
 
  Do you have a better suggestion? Sure it'll be a long file, but it still
  seems vastly superiour to what we have now.
 
 Not really.  What had occurred to me is to try to improve the format
 of the DATA lines (e.g. by allowing names to be used instead of OIDs)

That's a separate patch so far, so if we decide to only want thta we can
do it.

 but that wouldn't allow defaulted fields to be omitted, which is
 certainly a big win.  I wonder whether some home-grown single-line
 format might be better than using a pre-existing format, but I'm not
 too sure it would.

I can't see readability of anything being good unless the column names
are there - we just have too many columns in some of the tables. I think
having more lines is a acceptable price to pay. We can easily start to
split the files at some point if we want, that'd just be a couple lines
of code.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Peter Eisentraut
On 3/3/15 9:49 PM, Robert Haas wrote:
 Yeah.  One thought though is that I don't think we need the data layer
 in your proposal; that is, I'd flatten the representation to something
 more like

  {
  oid = 2249,
  oiddefine = 'CSTRINGOID',
  typname = 'cstring',
  typlen = -2,
  typbyval = 1,
  ...
  }
 
 Even this promises to vastly increase the number of lines in the file,

I think lines are cheap.  Columns are much harder to deal with.

 and make it harder to compare entries by grepping out some common
 substring.

Could you give an example of the sort of thing you wish to do?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Mar 4, 2015 at 9:06 AM, Peter Eisentraut pete...@gmx.net wrote:
 Could you give an example of the sort of thing you wish to do?

 e.g. grep for a function name and check that all the matches have the
 same volatility.

Well, grep is not going to work too well anymore, but extracting a
specific field from an entry is going to be beyond the competence of
simple grep/sed tools anyway if we allow column default substitutions.

I think a fairer question is can you do that in a one-liner Perl script,
which seems like it might be achievable given an appropriate choice of
data markup language.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Andrew Dunstan


On 03/04/2015 09:42 AM, Tom Lane wrote:

Peter Eisentraut pete...@gmx.net writes:

On 3/3/15 9:49 PM, Robert Haas wrote:

Even this promises to vastly increase the number of lines in the file,

I think lines are cheap.  Columns are much harder to deal with.

Yeah.  pg_proc.h is already impossible to work with in a standard
80-column window.  I don't want to find that the lines mostly wrap even
when I expand my editor window to full screen width, but that is certainly
what will happen if we adopt column labelling *and* insist that entries
remain all on one line.  (As a data point, the maximum usable Emacs window
width on my Mac laptop seems to be about 230 characters.)

It's possible that gaining the ability to depend on per-column defaults
would reduce the typical number of fields so much that pg_proc.h entries
would still fit on a line of 100-some characters ... but I'd want to see
proof before assuming that.  And pg_proc isn't even our widest catalog.
Some of the ones that are wider, like pg_am, don't seem like there would
be any scope whatsoever for saving space with per-column defaults.

So while I can see the attraction of trying to fit things on one line,
I doubt it's gonna work very well.  I'd rather go over to a
one-value-per-line format and live with lots of lines.




Is it necessarily an all or nothing deal?

Taking a previous example, we could have something like:

{
 oid = 2249,  oiddefine = 'CSTRINGOID',  typname = 'cstring',
 typlen = -2, typbyval = 1,
 ...
}

which would allow us to fit within a reasonable edit window (for my 
normal window and font that's around 180 characters) and still reduce 
the number of lines.


I'm not wedded to it, but it's a thought.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Wed, Mar 4, 2015 at 9:42 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  and make it harder to compare entries by grepping out some common
  substring.
 
  Could you give an example of the sort of thing you wish to do?
 
  On that angle, I'm dubious that a format that allows omission of fields is
  going to be easy for editing scripts to modify, no matter what the layout
  convention is.  I've found it relatively easy to write sed or even Emacs
  macros to add new column values to old-school pg_proc.h ... but in this
  brave new world, I'm going to be really hoping that the column default
  works for 99.9% of pg_proc entries when we add a new pg_proc column,
  because slipping a value into a desired position is gonna be hard for
  a script when you don't know whether the adjacent existing fields are
  present or not.
 
 I wonder if we should have a tool in our repository to help people
 edit the file.  So instead of going in there yourself and changing
 things by hand, or writing your own script, you can do:
 
 updatepgproc.pl --oid 5678 provolatile=v
 
 or
 
 updatepgpproc.pl --name='.*xact.*' prowhatever=someval
 
 Regardless of what format we end up with, that seems like it would
 make things easier.

Alright, I'll bite on this- we have this really neat tool for editing
data in bulk, or individual values, or pulling out data to look at based
on particular values or even functions...  It's called PostgreSQL.

What if we had an easy way to export an existing table into whatever
format we decide to use for initdb to use?  For that matter, what if
that file was simple to import into PG?

What about having a way to load all the catalog tables from their git
repo files into a pg_dev schema?  Maybe even include a make target or
initdb option which does that?  (the point here being to provide a way
to modify the tables and compare the results to the existing tables
without breaking the instance one is using for this)

I have to admit that I've never tried to do that with the existing
format, but seems like an interesting idea to consider.  I further
wonder if it'd be possible to generate the table structures too..

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-04 Thread Andrew Dunstan


On 03/04/2015 09:51 AM, Robert Haas wrote:

On Wed, Mar 4, 2015 at 9:06 AM, Peter Eisentraut pete...@gmx.net wrote:

and make it harder to compare entries by grepping out some common
substring.

Could you give an example of the sort of thing you wish to do?

e.g. grep for a function name and check that all the matches have the
same volatility.



I think grep will be the wrong tool for this format, but if we're 
settling on a perl format, a few perl one-liners should be able to work 
pretty well. It might be worth shipping a small perl module that 
provided some functions, or a script doing common tasks (or both).


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-03 Thread Robert Haas
On Sat, Feb 21, 2015 at 11:34 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On 2015-02-20 22:19:54 -0500, Peter Eisentraut wrote:
 On 2/20/15 8:46 PM, Josh Berkus wrote:
 Or what about just doing CSV?

 I don't think that would actually address the problems.  It would just
 be the same format as now with different delimiters.

 Yea, we need hierarchies and named keys.

 Yeah.  One thought though is that I don't think we need the data layer
 in your proposal; that is, I'd flatten the representation to something
 more like

  {
  oid = 2249,
  oiddefine = 'CSTRINGOID',
  typname = 'cstring',
  typlen = -2,
  typbyval = 1,
  ...
  }

Even this promises to vastly increase the number of lines in the file,
and make it harder to compare entries by grepping out some common
substring.  I agree that the current format is a pain in the tail, but
pg_proc.h is 5k lines already.  I don't want it to be 100k lines
instead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-03-02 Thread Greg Stark
On Sat, Feb 21, 2015 at 11:08 PM, Andres Freund and...@2ndquadrant.com wrote:
 The changes in pg_proc.h are just to demonstrate that using names
 instead of oids works.

Fwiw I always thought it was strange how much of our bootstrap was
done in a large static text file. Very little of it is actually needed
for bootstrapping and we could get by with a very small set followed
by a bootstrap script written in standard SQL, not unlike how the
system views are created. It's much easier to type CREATE OPERATOR and
CREATE OPERATOR CLASS with all the symbolic names instead of having to
fill in the table.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-21 Thread Andres Freund
On 2015-02-21 17:43:09 +0100, Andres Freund wrote:
 One thing I was considering was to do the regtype and regproc lookups
 directly in the tool. That'd have two advantages: 1) it'd make it
 possible to refer to typenames in pg_proc, 2) It'd be much faster. Right
 now most of initdb's time is doing syscache lookups during bootstrap,
 because it can't use indexes... A simple hash lookup during bki
 generation could lead to quite measurable savings during lookup.

I've *very* quickly hacked this up. Doing this for all regproc columns
gives a consistent speedup in an assert enabled from ~0m3.589s to
~0m2.544s. My guess is that the relative speedup in optimized mode would
actually be even bigger as now most of the time is spent in
AtEOXact_CatCache.

Given that pg_proc is unlikely to get any smaller and that the current
code is essentially O(lookups * #pg_proc), this alone seems to be worth
a good bit.

The same trick should also allow us to simply refer to type names in
pg_proc et al. If we had a way to denote a column being of type
relnamespace/relauthid we could replace
$row-{bki_values} =~ s/\bPGUID\b/$BOOTSTRAP_SUPERUSERID/g;
$row-{bki_values} =~ s/\bPGNSP\b/$PG_CATALOG_NAMESPACE/g;
as well.

The changes in pg_proc.h are just to demonstrate that using names
instead of oids works.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 39e6d60969327575b4797186c4577df8edd21fa5 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Sun, 22 Feb 2015 00:06:18 +0100
Subject: [PATCH] WIP: resolve regtype/regproc in genbki.pl

Faster, and allows us to rely on them earlier.
---
 src/backend/catalog/Catalog.pm   | 15 ++
 src/backend/catalog/genbki.pl| 91 +---
 src/backend/utils/adt/regproc.c  | 82 +++-
 src/include/c.h  |  3 ++
 src/include/catalog/pg_proc.h| 30 +--
 src/test/regress/expected/opr_sanity.out |  8 +--
 6 files changed, 128 insertions(+), 101 deletions(-)

diff --git a/src/backend/catalog/Catalog.pm b/src/backend/catalog/Catalog.pm
index c7b1c17..64af70b 100644
--- a/src/backend/catalog/Catalog.pm
+++ b/src/backend/catalog/Catalog.pm
@@ -196,6 +196,21 @@ sub Catalogs
 }
 			}
 		}
+
+		# allow to lookup columns by name
+		$catalog{columns_byname} = {};
+		my @columnnames;
+		my @columntypes;
+
+		foreach my $column (@{ $catalog{columns} })
+		{
+		$catalog{column_byname}{$column-{'name'}} = $column;
+		push @columnnames, $column-{'name'};
+		push @columntypes, $column-{'type'};
+		}
+		$catalog{columnnames} = \@columnnames;
+		$catalog{columntypes} = \@columntypes;
+
 		$catalogs{$catname} = \%catalog;
 		close INPUT_FILE;
 	}
diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index a5c78ee..8c55fc1 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -104,6 +104,89 @@ my %schemapg_entries;
 my @tables_needing_macros;
 our @types;
 
+my %catalogs_by_name;
+
+# in a first pass, parse data and build some lookup tables
+foreach my $catname (@{ $catalogs-{names} })
+{
+my $catalog = $catalogs-{$catname};
+my %byname;
+my %byoid;
+my $name;
+
+# Column to use for lookup mapping
+if ($catname eq 'pg_type')
+{
+	$name = 'typname';
+}
+elsif ($catname eq 'pg_proc')
+{
+	$name = 'proname';
+}
+
+foreach my $row (@{ $catalog-{data} })
+{
+	my %valuesbyname;
+
+	# substitute constant values we acquired above
+	$row-{bki_values} =~ s/\bPGUID\b/$BOOTSTRAP_SUPERUSERID/g;
+	$row-{bki_values} =~ s/\bPGNSP\b/$PG_CATALOG_NAMESPACE/g;
+
+	# split data into actual columns
+	my @values = split /\s+/, $row-{bki_values};
+
+	# store values in a more useful format
+	$row-{values} = \@values;
+
+	# build lookup table if necessary
+	if ($name and defined $row-{oid})
+	{
+	@valuesbyname{ @{ $catalog-{columnnames} } } = @values;
+	$byname{$valuesbyname{$name}} = $row-{oid};
+	$byoid{$row-{oid}} = $valuesbyname{$name};
+	}
+}
+if (%byname)
+{
+	$catalog-{byname} = \%byname;
+	$catalog-{byoid} = \%byoid;
+}
+}
+
+# in a second pass, resolve resolve references and similar things in the data
+foreach my $catname (@{ $catalogs-{names} })
+{
+my $catalog = $catalogs-{$catname};
+
+foreach my $row (@{ $catalog-{data} })
+{
+	my $colno = 0;
+	foreach my $column (@{ $catalog-{columns} })
+	{
+	my $value = $row-{values}-[$colno];
+
+	if ($column-{type} eq 'regproc')
+	{
+		if ($value ne '-'  $value !~ /^\d+$/)
+		{
+		my $replacement = $catalogs-{pg_proc}-{byname}-{$value};
+		$row-{values}-[$colno] = $replacement;
+		}
+	}
+	elsif ($column-{type} eq 'regtype')
+	{
+		if ($value ne '-'  $value !~ /^\d+$/)
+		{
+		my $replacement = $catalogs-{pg_type}-{byname}-{$value};
+		

Re: [HACKERS] Bootstrap DATA is a pita

2015-02-21 Thread Andres Freund
On 2015-02-20 22:19:54 -0500, Peter Eisentraut wrote:
 On 2/20/15 8:46 PM, Josh Berkus wrote:
  What about YAML?  That might have been added somewhat earlier.

 YAML isn't included in Perl, but there is

 Module::Build::YAML - Provides just enough YAML support so that
 Module::Build works even if YAML.pm is not installed

I'm afraid not:

sub Load {
shift if ($_[0] eq __PACKAGE__ || ref($_[0]) eq __PACKAGE__);
die not yet implemented;
}

  Or what about just doing CSV?

 I don't think that would actually address the problems.  It would just
 be the same format as now with different delimiters.

Yea, we need hierarchies and named keys.

 One concern I have with this is that in my experience different tools
 and editors have vastly different ideas on how to format these kinds of
 nested structures.  I'd try out YAML, or even a homemade fake YAML over
 this.

Yes, that's a good point. I have zero desire to open-code a format
though, I think that's a bad idea. We could say we just include
Yaml::Tiny, that's what it's made for.

To allow for changing things programatically without noise I was
wondering whether we shouldn't just load/dump the file at some point of
the build process. Then we're sure the indentation is correct and it can
be changed programatically wihtout requiring manual fixup of comments.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-21 Thread Andrew Dunstan


On 02/21/2015 09:39 AM, Andrew Dunstan wrote:


On 02/21/2015 05:04 AM, Andres Freund wrote:


Yes, that's a good point. I have zero desire to open-code a format
though, I think that's a bad idea. We could say we just include
Yaml::Tiny, that's what it's made for.




Personally, I think I would prefer that we use JSON (and yes, there's 
a JSON::Tiny module, which definitely lives up to its name).


For one thing, we've made a feature of supporting JSON, so arguably we 
should eat the same dog food.


I also dislike YAML's line oriented format. I'd like to be able to add 
a pg_proc entry in a handful of lines instead of 29 or more (pg_proc 
has 27 attributes, but some of them are arrays, and there's an oid and 
in most cases a description to add as well). We could reduce that 
number by defaulting some of the attributes (pronamespace, proowner 
and prolang, for example) and possibly infering others (pronargs?). 
Even so it's going to take up lots of lines of vertical screen real 
estate. A JSON format could be more vertically compact. The price for 
that is that JSON strings have to be quoted, which I know lots of 
people hate.





Followup:

The YAML spec does support explicit flows like JSON, which would 
overcome my objections above, but unfortunately these are not supported 
by YAML::Tiny.



cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-21 Thread Andres Freund
On 2015-02-21 11:34:09 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2015-02-20 22:19:54 -0500, Peter Eisentraut wrote:
  On 2/20/15 8:46 PM, Josh Berkus wrote:
  Or what about just doing CSV?
 
  I don't think that would actually address the problems.  It would just
  be the same format as now with different delimiters.
 
  Yea, we need hierarchies and named keys.
 
 Yeah.  One thought though is that I don't think we need the data layer
 in your proposal; that is, I'd flatten the representation to something
 more like
 
  {
oid = 2249,
oiddefine = 'CSTRINGOID',
typname = 'cstring',
  typlen = -2,
  typbyval = 1,
  ...
  }

I don't really like that - then stuff like oid, description, comment (?)
have to not conflict with any catalog columns. I think it's easier to
have them separate.

 This will be easier to edit, either manually or programmatically I think.
 The code that turns it into a .bki file will need to know the exact set
 of columns in each system catalog, but it would have had to know that
 anyway I believe, if you're expecting it to insert default values.

There'll need to be some awareness of columns, sure. But I think
programatically editing the values will still be simpler if you don't
need to discern whether a key is a column or some genbki specific value.

 Ideally the column defaults could come from BKI_ macros in the catalog/*.h
 files; it would be good if we could keep those files as the One Source of
 Truth for catalog schema info, even as we split out the initial data.

Hm, yea.

One thing I was considering was to do the regtype and regproc lookups
directly in the tool. That'd have two advantages: 1) it'd make it
possible to refer to typenames in pg_proc, 2) It'd be much faster. Right
now most of initdb's time is doing syscache lookups during bootstrap,
because it can't use indexes... A simple hash lookup during bki
generation could lead to quite measurable savings during lookup.

We could then even rip the bootstrap code out of regtypein/regprocin...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-21 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 02/21/2015 09:39 AM, Andrew Dunstan wrote:
 Personally, I think I would prefer that we use JSON (and yes, there's 
 a JSON::Tiny module, which definitely lives up to its name).
 For one thing, we've made a feature of supporting JSON, so arguably we 
 should eat the same dog food.

We've also made a feature of supporting XML, and a lot earlier, so that
argument seems pretty weak.

My only real requirement on the format choice is that it should absolutely
not require any Perl module that's not in a bog-standard installation.
I've gotten the buildfarm code running on several ancient machines now and
in most cases getting the module dependencies dealt with was pure hell.
No non-core modules for a basic build please.  I don't care whether they
are tiny.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-21 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2015-02-20 22:19:54 -0500, Peter Eisentraut wrote:
 On 2/20/15 8:46 PM, Josh Berkus wrote:
 Or what about just doing CSV?

 I don't think that would actually address the problems.  It would just
 be the same format as now with different delimiters.

 Yea, we need hierarchies and named keys.

Yeah.  One thought though is that I don't think we need the data layer
in your proposal; that is, I'd flatten the representation to something
more like

 {
 oid = 2249,
 oiddefine = 'CSTRINGOID',
 typname = 'cstring',
 typlen = -2,
 typbyval = 1,
 ...
 }

This will be easier to edit, either manually or programmatically I think.
The code that turns it into a .bki file will need to know the exact set
of columns in each system catalog, but it would have had to know that
anyway I believe, if you're expecting it to insert default values.

Ideally the column defaults could come from BKI_ macros in the catalog/*.h
files; it would be good if we could keep those files as the One Source of
Truth for catalog schema info, even as we split out the initial data.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-21 Thread Andrew Dunstan


On 02/21/2015 11:43 AM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 02/21/2015 09:39 AM, Andrew Dunstan wrote:

Personally, I think I would prefer that we use JSON (and yes, there's
a JSON::Tiny module, which definitely lives up to its name).
For one thing, we've made a feature of supporting JSON, so arguably we
should eat the same dog food.

We've also made a feature of supporting XML, and a lot earlier, so that
argument seems pretty weak.



Fair enough



My only real requirement on the format choice is that it should absolutely
not require any Perl module that's not in a bog-standard installation.
I've gotten the buildfarm code running on several ancient machines now and
in most cases getting the module dependencies dealt with was pure hell.
No non-core modules for a basic build please.  I don't care whether they
are tiny.




The point about using the tiny modules is that they are so small and 
self-contained they can either be reasonably shipped with our code or 
embedded directly in the script that uses them, so no extra build 
dependency would be created.


However, I rather like your suggestion of this:


  {
 oid = 2249,
 oiddefine = 'CSTRINGOID',
 typname = 'cstring',
  typlen = -2,
  typbyval = 1,
  ...
  }



which is pure perl syntax and wouldn't need any extra module, and has 
the advantage over JSON that key names won't need to be quoted, making 
it more readable.



cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-21 Thread Andrew Dunstan


On 02/21/2015 05:04 AM, Andres Freund wrote:


Yes, that's a good point. I have zero desire to open-code a format
though, I think that's a bad idea. We could say we just include
Yaml::Tiny, that's what it's made for.




Personally, I think I would prefer that we use JSON (and yes, there's a 
JSON::Tiny module, which definitely lives up to its name).


For one thing, we've made a feature of supporting JSON, so arguably we 
should eat the same dog food.


I also dislike YAML's line oriented format. I'd like to be able to add a 
pg_proc entry in a handful of lines instead of 29 or more (pg_proc has 
27 attributes, but some of them are arrays, and there's an oid and in 
most cases a description to add as well). We could reduce that number by 
defaulting some of the attributes (pronamespace, proowner and prolang, 
for example) and possibly infering others (pronargs?). Even so it's 
going to take up lots of lines of vertical screen real estate. A JSON 
format could be more vertically compact. The price for that is that JSON 
strings have to be quoted, which I know lots of people hate.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-21 Thread Andres Freund
On February 21, 2015 7:20:04 PM CET, Andrew Dunstan and...@dunslane.net wrote:

On 02/21/2015 11:43 AM, Tom Lane wrote:

   {
   oid = 2249,
   oiddefine = 'CSTRINGOID',
   typname = 'cstring',
   typlen = -2,
   typbyval = 1,
   ...
   }


which is pure perl syntax and wouldn't need any extra module, and has 
the advantage over JSON that key names won't need to be quoted, making 
it more readable.

Yea, my original post suggested using actual perl hashes to avoid problems with 
the availability of libraries. So far I've not really heard a convincing 
alternative. Peter's problem with formatting seems to be most easily solved by 
rewriting the file automatically...

Andres


-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-20 Thread Josh Berkus
On 02/20/2015 03:41 PM, Andres Freund wrote:
 What I think we should do is to add pg_catalog.data files that contain
 the actual data that are automatically parsed by Catalog.pm. Those
 contain the rows in some to-be-decided format. I was considering using
 json, but it turns out only perl 5.14 started shipping JSON::PP as part
 of the standard library. So I guess it's best we just make it a big perl
 array + hashes.

What about YAML?  That might have been added somewhat earlier.

Or what about just doing CSV?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-20 Thread Petr Jelinek

On 21/02/15 04:22, Peter Eisentraut wrote:

I violently support this proposal.


Maybe something rougly like:

# pg_type.data
CatalogData(
 'pg_type',
 [
  {
 oid = 2249,
 data = {typname = 'cstring', typlen = -2, typbyval = 1, fake = 
'...'},
 oiddefine = 'CSTRINGOID'
  }
 ]
);


One concern I have with this is that in my experience different tools
and editors have vastly different ideas on how to format these kinds of
nested structures.  I'd try out YAML, or even a homemade fake YAML over
this.



+1 for the idea and +1 for YAML(-like) syntax.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-20 Thread Peter Eisentraut
On 2/20/15 8:46 PM, Josh Berkus wrote:
 What about YAML?  That might have been added somewhat earlier.

YAML isn't included in Perl, but there is

Module::Build::YAML - Provides just enough YAML support so that
Module::Build works even if YAML.pm is not installed

which might work.

 Or what about just doing CSV?

I don't think that would actually address the problems.  It would just
be the same format as now with different delimiters.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-02-20 Thread Peter Eisentraut
I violently support this proposal.

 Maybe something rougly like:
 
 # pg_type.data
 CatalogData(
 'pg_type',
 [
  {
oid = 2249,
data = {typname = 'cstring', typlen = -2, typbyval = 1, fake = 
 '...'},
oiddefine = 'CSTRINGOID'
  }
 ]
 );

One concern I have with this is that in my experience different tools
and editors have vastly different ideas on how to format these kinds of
nested structures.  I'd try out YAML, or even a homemade fake YAML over
this.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Bootstrap DATA is a pita

2015-02-20 Thread Andres Freund
Hi,

I've been for a long while been rather annoyed about how cumbersome it
is to add catalog rows using the bootstrap format. Especially pg_proc.h,
pg_operator.h, pg_amop.h, pg_amproc.h and some more are really unwieldy.

I think this needs to be improved. And while I'm not going to start
working on it tonight, I do plan to work on it if we can agree on a
design that I think is worth implementing.

The things that bug me most are:

1) When adding new rows it's rather hard to kno which columns are which,
   and you have to specify a lot you really don't care about. Especially
   in pg_proc that's rather annoying.

2) Having to assign oids for many things that don't actually need is
   bothersome and greatly increases the likelihood of conflicts. There's
   some rows for which we need fixed oids (pg_type ones for example),
   but e.g. for the majority of pg_proc it's unnecessary.

3) Adding a new column to a system catalog, especially pg_proc.h,
   basically requires writing a complex regex or program to modify the
   header.

Therefore I propose that we add another format to generate the .bki
insert lines.

What I think we should do is to add pg_catalog.data files that contain
the actual data that are automatically parsed by Catalog.pm. Those
contain the rows in some to-be-decided format. I was considering using
json, but it turns out only perl 5.14 started shipping JSON::PP as part
of the standard library. So I guess it's best we just make it a big perl
array + hashes.

To address 1) we just need to make each row a hash and allow leaving out
columns that have some default value.

2) is a bit more complex. Generally many rows don't need a fixed oid at
all and many others primarily need it to handle object descriptions. The
latter seems best best solved by not making it dependant on the oid
anymore.

3) Seems primarily solved by not requiring default values to be
specified anymore. Also it should be much easier to add new values
automatically to a parseable format.

I think we'll need to generate oid #defines for some catalog contents,
but that seems solveable.

Maybe something rougly like:

# pg_type.data
CatalogData(
'pg_type',
[
 {
 oid = 2249,
 data = {typname = 'cstring', typlen = -2, typbyval = 1, fake = 
'...'},
 oiddefine = 'CSTRINGOID'
 }
]
);

# pg_proc.data
CatalogData(
'pg_proc',
[
 {
 oid = 1242,
 data = {proname = 'boolin', proretttype = 16, proargtypes = 
[2275], provolatile = 'i'},
 description = 'I/O',
 },
 {
 data = {proname = 'mode_final', proretttype = 2283, proargtypes = 
[2281, 2283]},
 description = 'aggregate final function',
 }
]
);

There'd need to be some logic to assign default values for columns, and
maybe even simple logic e.g. to determine arguments like pronargs based
on proargtypes.

This is far from fully though through, but I think something very
roughly along these lines could be a remarkable improvement in the ease
of adding new catalog contents.

Comments?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers