Re: Password leakage avoidance
Scratch that, rather than memset(...) should be explicit_bzero(...) so it doesn't get optimized out. Same idea though. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ >
Re: Password leakage avoidance
On Sat, Jan 6, 2024 at 12:39 PM Joe Conway wrote: > The only code specific comments were Tom's above, which have been > addressed. If there are no serious objections I plan to commit this > relatively soon. > One more thing that we do in pgjdbc is to zero out the input password args so that they don't remain in memory even after being freed. It's kind of odd in Java as it makes the input interface a char[] and we have to convert them to garbage collected Strings internally (which kind of defeats the purpose of the exercise). But in libpq could be done via something like: memset(pw1, 0, strlen(pw1)); memset(pw2, 0, strlen(pw2)); There was some debate on our end of where to do that and we settled on doing it inside the encoding functions to ensure it always happens. So the input password char[] always gets wiped regardless of how the encoding functions are invoked. Even if it's not added to the password encoding functions (as that kind of changes the after effects if anything was relying on the password still having the password), I think it'd be good to add it to the command.c stuff that has the two copies of the password prior to freeing them. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Password leakage avoidance
On Sat, Jan 6, 2024 at 11:53 AM Joe Conway wrote: > On 1/2/24 07:23, Sehrope Sarkuni wrote: > > 1. There's two sets of defaults, the client program's default and the > > server's default. Need to pick one for each implemented function. They > > don't need to be the same across the board. > > Is there a concrete recommendation here? > Between writing that and now, we scrapped the "driver picks" variant of this. Now we only have explicit functions for each of the encoding types (i.e. one for md5 and one for SCRAM-SHA-256) and an alterUserPassword(...) method that uses the default for the database via reading the password_encrypotion GUC. We also have some javadoc comments on the encoding functions to strongly suggest using the SCRAM functions and only use the md5 directly for legacy servers. The "driver picks" one was removed to prevent a situation where an end user picks the driver default and it's not compatible with their server. The rationale was if the driver's SCRAM-SHA-256 default is ever replaced with something else (e.g. SCRAM-SHA-512) we'd end up with an interim state where an upgraded driver application would try to use that newer encryption method on an old DB. If a user is going to do that, they would have to be explicit with their choice of encoding functions (hence removing the "driver picks" variant). So the recommendation is to have explicit functions for each variant and have the end-to-end change password code read from the DB. My understanding of this patch is that it does exactly that. > > 2. Password encoding should be split out and made available as its own > > functions. Not just as part of a wider "create _or_ alter a user's > > password" function attached to a connection. > > It already is split out in libpq[1], unless I don't understand you > correctly. > Sorry for the confusion. My original list wasn't any specific contrasts with what libpq is doing. Was more of a summary of thoughts having just concluded implementing the same type of password change stuff in PGJDBC. >From what I've seen in this patch, it either aligns with how we did things in PGJDBC or it's something that isn't as relevant in this context (e.g. generating the SQL text as a public function). Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Password leakage avoidance
Having worked on and just about wrapped up the JDBC driver patch for this, couple thoughts: 1. There's two sets of defaults, the client program's default and the server's default. Need to pick one for each implemented function. They don't need to be the same across the board. 2. Password encoding should be split out and made available as its own functions. Not just as part of a wider "create _or_ alter a user's password" function attached to a connection. We went a step further and added an intermediate function that generates the "ALTER USER ... PASSWORD" SQL. 3. We only added an "ALTER USER ... PASSWORD" function, not anything to create a user. There's way too many options for that and keeping this targeted at just assigning passwords makes it much easier to test. 4. RE:defaults, the PGJDBC approach is that the encoding-only function uses the driver's default (SCRAM). The "alterUserPassword(...)" uses the server's default (again usually SCRAM for modern installs but could be something else). It's kind of odd that they're different but the use cases are different as well. 5. Our SCRAM specific function allows for customizing the algo iteration and salt parameters. That topic came up on hackers previously[1]. Our high level "alterUserPassword(...)" function does not have those options but it is available as part of our PasswordUtil SCRAM API for advanced users who want to leverage it. The higher level functions have defaults for iteration counts (4096) and salt size (16-bytes). 6. Getting the verbiage right for the PGJDBC version was kind of annoying as we wanted to match the server's wording, e.g. "password_encryption", but it's clearly hashing, not encryption. We settled on "password encoding" for describing the overall task with the comments referencing the server's usage of the term "password_encryption". Found a similar topic[2] on changing that recently as well but looks like that's not going anywhere. [1]: https://www.postgresql.org/message-id/1d669d97-86b3-a5dc-9f02-c368bca911f6%40iki.fi [2]: https://www.postgresql.org/message-id/flat/ZV149Fd2JG_OF7CM%40momjian.us#cc97d20ff357a9e9264d4ae14e96e566 Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Emitting JSON to file using COPY TO
On Wed, Dec 6, 2023 at 4:29 PM Joe Conway wrote: > > 1. Outputting a top level JSON object without the additional column > > keys. IIUC, the top level keys are always the column names. A common use > > case would be a single json/jsonb column that is already formatted > > exactly as the user would like for output. Rather than enveloping it in > > an object with a dedicated key, it would be nice to be able to output it > > directly. This would allow non-object results to be outputted as well > > (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is > > structured, I think this would play nice with the JSON lines v.s. array > > concept. > > > > COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM > > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, > > SOME_OPTION_TO_NOT_ENVELOPE) > > {"foo":1} > > {"foo":2} > > {"foo":3} > > Your example does not match what you describe, or do I misunderstand? I > thought your goal was to eliminate the repeated "foo" from each row... > The "foo" in this case is explicit as I'm adding it when building the object. What I was trying to show was not adding an additional object wrapper / envelope. So each row is: {"foo":1} Rather than: "{"json_build_object":{"foo":1}} If each row has exactly one json / jsonb field, then the user has already indicated the format for each row. That same mechanism can be used to remove the "foo" entirely via a json/jsonb array. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Emitting JSON to file using COPY TO
On Wed, Dec 6, 2023 at 4:03 PM Andrew Dunstan wrote: > > The output size difference does say that maybe we should pay some > > attention to the nearby request to not always label every field. > > Perhaps there should be an option for each row to transform to > > a JSON array rather than an object? > > I doubt it. People who want this are likely to want pretty much what > this patch is providing, not something they would have to transform in > order to get it. If they want space-efficient data they won't really be > wanting JSON. Maybe they want Protocol Buffers or something in that vein. > For arrays v.s. objects, it's not just about data size. There are plenty of situations where a JSON array is superior to an object (e.g. duplicate column names). Lines of JSON arrays of strings is pretty much CSV with JSON escaping rules and a pair of wrapping brackets. It's common for tabular data in node.js environments as you don't need a separate CSV parser. Each one has its place and a default of the row_to_json(...) representation of the row still makes sense. But if the user has the option of outputting a single json/jsonb field for each row without an object or array wrapper, then it's possible to support all of these use cases as the user can explicitly pick whatever envelope makes sense: -- Lines of JSON arrays: COPY (SELECT json_build_array('test-' || a, b) FROM generate_series(1, 3) a, generate_series(5,6) b) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_DISABLE_ENVELOPE); ["test-1", 5] ["test-2", 5] ["test-3", 5] ["test-1", 6] ["test-2", 6] ["test-3", 6] -- Lines of JSON strings: COPY (SELECT to_json('test-' || x) FROM generate_series(1, 5) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_DISABLE_ENVELOPE); "test-1" "test-2" "test-3" "test-4" "test-5" I'm not sure how I feel about the behavior being automatic if it's a single top level json / jsonb field rather than requiring the explicit option. It's probably what a user would want but it also feels odd to change the output wrapper automatically based on the fields in the response. If it is automatic and the user wants the additional envelope, the option always exists to wrap it further in another: json_build_object('some_field", my_field_i_want_wrapped) The duplicate field names would be a good test case too. I haven't gone through this patch but I'm guessing it doesn't filter out duplicates so the behavior would match up with row_to_json(...), i.e. duplicates are preserved: => SELECT row_to_json(t.*) FROM (SELECT 1 AS a, 2 AS a) t; row_to_json --- {"a":1,"a":2} If so, that's a good test case to add as however that's handled should be deterministic. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Emitting JSON to file using COPY TO
Big +1 to this overall feature. This is something I've wanted for a long time as well. While it's possible to use a COPY with text output for a trivial case, the double escaping falls apart quickly for arbitrary data. It's really only usable when you know exactly what you are querying and know it will not be a problem. Regarding the defaults for the output, I think JSON lines (rather than a JSON array of objects) would be preferred. It's more natural to combine them and generate that type of data on the fly rather than forcing aggregation into a single object. Couple more features / use cases come to mind as well. Even if they're not part of a first round of this feature I think it'd be helpful to document them now as it might give some ideas for what does make that first cut: 1. Outputting a top level JSON object without the additional column keys. IIUC, the top level keys are always the column names. A common use case would be a single json/jsonb column that is already formatted exactly as the user would like for output. Rather than enveloping it in an object with a dedicated key, it would be nice to be able to output it directly. This would allow non-object results to be outputted as well (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is structured, I think this would play nice with the JSON lines v.s. array concept. COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_NOT_ENVELOPE) {"foo":1} {"foo":2} {"foo":3} 2. An option to ignore null fields so they are excluded from the output. This would not be a default but would allow shrinking the total size of the output data in many situations. This would be recursive to allow nested objects to be shrunk down (not just the top level). This might be worthwhile as a standalone JSON function though handling it during output would be more efficient as it'd only be read once. COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS) {} {"foo":2} {"foo":3} 3. Reverse of #2 when copying data in to allow defaulting missing fields to NULL. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: PG 16 draft release notes ready
The intro in "E.1.3. Changes" says "... between PostgreSQL 15 and the previous major release". That should be "... between PostgreSQL __16__ ..." right? Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Proposal to provide the facility to set binary format output for specific OID's per session
Idea here makes sense and I've seen this brought up repeatedly on the JDBC lists. Does the driver need to be aware that this SET command was executed? I'm wondering what happens if an end user executes this with an OID the driver does not actually know how to handle. > + Oid *tmpOids = palloc(length+1); > ... > + tmpOids = repalloc(tmpOids, length+1); These should be: sizeof(Oid) * (length + 1) Also, I think you need to specify an explicit context via MemoryContextAlloc or the allocated memory will be in the default context and released at the end of the command. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Add jsonlog log_destination for JSON server logs
On Thu, Sep 16, 2021 at 9:36 PM Michael Paquier wrote: > I am not really impressed by 0001 and the introduction of > LOG_DESTINATIONS_WITH_FILES. So I would leave that out and keep the > list of destinations listed instead. And we are talking about two > places here, only within syslogger.c. > I've taken that out for now. The idea was to simplify the additions when jsonlog is added but can circle back to that later if it makes sense. > 0002 looks like an improvement, Nice. That's left unchanged (renamed to 0001 now). > but I think that 0003 makes the > readability of the code worse with the introduction of eight static > routines to handle all those cases. > > file_log_dest_should_rotate_for_size(), file_log_dest_close(), > file_log_dest_check_rotate_for_size(), get_syslogger_file() don't > bring major improvements. On the contrary, > file_log_dest_write_metadata and file_log_dest_rotate seem to reduce a > bit the noise. > It was helpful to split them out while working on the patch but I see your point upon re-reading through the result. Attached version (renamed to 002) adds only three static functions for checking rotation size, performing the actual rotation, and closing. Also one other to handle generating the logfile names with a suffix to handle the pfree-ing (wrapping logfile_open(...)). The rest of the changes happen in place using the new structures. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ From c85e15ffc6a8e310c10be4c85580980d04846bf2 Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Thu, 16 Sep 2021 15:57:00 -0400 Subject: [PATCH 1/2] Split out syslogger EXEC_BACKEND fd serialization and opening into helper functions --- src/backend/postmaster/syslogger.c | 109 - 1 file changed, 44 insertions(+), 65 deletions(-) diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index bca3883572..48b4c48a18 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -730,9 +730,23 @@ SysLogger_Start(void) return 0; } - #ifdef EXEC_BACKEND +static long syslogger_get_fileno(FILE *file) +{ +#ifndef WIN32 + if (file != NULL) + return (long) fileno(file); + else + return -1; +#else /* WIN32 */ + if (file != NULL) + return (long) _get_osfhandle(_fileno(file)); + else + return 0; +#endif /* WIN32 */ +} + /* * syslogger_forkexec() - * @@ -751,34 +765,9 @@ syslogger_forkexec(void) av[ac++] = NULL; /* filled in by postmaster_forkexec */ /* static variables (those not passed by write_backend_variables) */ -#ifndef WIN32 - if (syslogFile != NULL) - snprintf(filenobuf, sizeof(filenobuf), "%d", - fileno(syslogFile)); - else - strcpy(filenobuf, "-1"); -#else /* WIN32 */ - if (syslogFile != NULL) - snprintf(filenobuf, sizeof(filenobuf), "%ld", - (long) _get_osfhandle(_fileno(syslogFile))); - else - strcpy(filenobuf, "0"); -#endif /* WIN32 */ + snprintf(filenobuf, sizeof(filenobuf), "%ld", syslogger_get_fileno(syslogFile)); av[ac++] = filenobuf; - -#ifndef WIN32 - if (csvlogFile != NULL) - snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%d", - fileno(csvlogFile)); - else - strcpy(csvfilenobuf, "-1"); -#else /* WIN32 */ - if (csvlogFile != NULL) - snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%ld", - (long) _get_osfhandle(_fileno(csvlogFile))); - else - strcpy(csvfilenobuf, "0"); -#endif /* WIN32 */ + snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%ld", syslogger_get_fileno(csvlogFile)); av[ac++] = csvfilenobuf; av[ac] = NULL; @@ -787,6 +776,31 @@ syslogger_forkexec(void) return postmaster_forkexec(ac, av); } +static FILE* syslogger_fdopen(int fd) +{ + FILE *file = NULL; + +#ifndef WIN32 + if (fd != -1) + { + file = fdopen(fd, "a"); + setvbuf(file, NULL, PG_IOLBF, 0); + } +#else /* WIN32 */ + if (fd != 0) + { + fd = _open_osfhandle(fd, _O_APPEND | _O_TEXT); + if (fd > 0) + { + file = fdopen(fd, "a"); + setvbuf(file, NULL, PG_IOLBF, 0); + } + } +#endif /* WIN32 */ + + return file; +} + /* * syslogger_parseArgs() - * @@ -795,8 +809,6 @@ syslogger_forkexec(void) static void syslogger_parseArgs(int argc, char *argv[]) { - int fd; - Assert(argc == 5); argv += 3; @@ -807,41 +819,8 @@ syslogger_parseArgs(int argc, char *argv[]) * fails there's not a lot we can do to report the problem anyway. As * coded, we'll just crash on a null pointer dereference after failure... */ -#ifndef WIN32 - fd = atoi(*argv++); - if (fd != -1) - { - syslogFile = fdopen(fd, "a"); - setvbuf(syslogFile, NULL, PG_IOLBF, 0); - } - fd = atoi(*argv++); - if (fd != -1) - { - csvlogFile = fdopen(fd, "a"); - setvbuf(csvlogFile, NULL, PG_IOLBF, 0); - } -#else /*
Re: Add jsonlog log_destination for JSON server logs
The previous patches failed on the cfbot Appveyor Windows build. That also makes me question whether I did the EXEC_BACKEND testing correctly as it should have caught that compile error. I'm going to look into that. In the meantime the attached should correct that part of the build. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ > From 0f7978566b7770465060cd497a8dd4102b313878 Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Thu, 16 Sep 2021 14:43:31 -0400 Subject: [PATCH 1/3] Add constant for list of log destinations that use files --- src/backend/postmaster/syslogger.c | 5 ++--- src/include/utils/elog.h | 3 +++ 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index bca3883572..bc546af7ff 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -420,7 +420,7 @@ SysLoggerMain(int argc, char *argv[]) * was sent by pg_rotate_logfile() or "pg_ctl logrotate". */ if (!time_based_rotation && size_rotation_for == 0) -size_rotation_for = LOG_DESTINATION_STDERR | LOG_DESTINATION_CSVLOG; +size_rotation_for = LOG_DESTINATIONS_WITH_FILES; logfile_rotate(time_based_rotation, size_rotation_for); } @@ -1465,8 +1465,7 @@ update_metainfo_datafile(void) FILE *fh; mode_t oumask; - if (!(Log_destination & LOG_DESTINATION_STDERR) && - !(Log_destination & LOG_DESTINATION_CSVLOG)) + if (!(Log_destination & LOG_DESTINATIONS_WITH_FILES)) { if (unlink(LOG_METAINFO_DATAFILE) < 0 && errno != ENOENT) ereport(LOG, diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index f53607e12e..bea8b93da6 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -437,6 +437,9 @@ extern bool syslog_split_messages; #define LOG_DESTINATION_EVENTLOG 4 #define LOG_DESTINATION_CSVLOG 8 +/* Log destinations with file handles */ +#define LOG_DESTINATIONS_WITH_FILES (LOG_DESTINATION_CSVLOG | LOG_DESTINATION_STDERR) + /* Other exported functions */ extern void DebugFileOpen(void); extern char *unpack_sql_state(int sql_state); -- 2.25.1 From a22760648dad3832869b8c67886826e7cd306e52 Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Thu, 16 Sep 2021 15:57:00 -0400 Subject: [PATCH 2/3] Split out syslogger EXEC_BACKEND fd serialization and opening into helper functions --- src/backend/postmaster/syslogger.c | 109 - 1 file changed, 44 insertions(+), 65 deletions(-) diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index bc546af7ff..4f0477794e 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -730,9 +730,23 @@ SysLogger_Start(void) return 0; } - #ifdef EXEC_BACKEND +static long syslogger_get_fileno(FILE *file) +{ +#ifndef WIN32 + if (file != NULL) + return (long) fileno(file); + else + return -1; +#else /* WIN32 */ + if (file != NULL) + return (long) _get_osfhandle(_fileno(file)); + else + return 0; +#endif /* WIN32 */ +} + /* * syslogger_forkexec() - * @@ -751,34 +765,9 @@ syslogger_forkexec(void) av[ac++] = NULL; /* filled in by postmaster_forkexec */ /* static variables (those not passed by write_backend_variables) */ -#ifndef WIN32 - if (syslogFile != NULL) - snprintf(filenobuf, sizeof(filenobuf), "%d", - fileno(syslogFile)); - else - strcpy(filenobuf, "-1"); -#else /* WIN32 */ - if (syslogFile != NULL) - snprintf(filenobuf, sizeof(filenobuf), "%ld", - (long) _get_osfhandle(_fileno(syslogFile))); - else - strcpy(filenobuf, "0"); -#endif /* WIN32 */ + snprintf(filenobuf, sizeof(filenobuf), "%ld", syslogger_get_fileno(syslogFile)); av[ac++] = filenobuf; - -#ifndef WIN32 - if (csvlogFile != NULL) - snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%d", - fileno(csvlogFile)); - else - strcpy(csvfilenobuf, "-1"); -#else /* WIN32 */ - if (csvlogFile != NULL) - snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%ld", - (long) _get_osfhandle(_fileno(csvlogFile))); - else - strcpy(csvfilenobuf, "0"); -#endif /* WIN32 */ + snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%ld", syslogger_get_fileno(csvlogFile)); av[ac++] = csvfilenobuf; av[ac] = NULL; @@ -787,6 +776,31 @@ syslogger_forkexec(void) return postmaster_forkexec(ac, av); } +static FILE* syslogger_fdopen(int fd) +{ + FILE *file = NULL; + +#ifndef WIN32 + if (fd != -1) + { + file = fdopen(fd, "a"); + setvbuf(file, NULL, PG_IOLBF, 0); + } +#else /* WIN32 */ + if (fd != 0) + { + fd = _open_osfhandle(fd, _O_APPEND | _O_TEXT); + if (fd > 0) + { + file = fdopen(fd, "a"); + setvbuf(file, NULL, PG_IOLBF, 0); + } + } +#endif /* WIN32 */ + + return file; +} + /*
Re: Add jsonlog log_destination for JSON server logs
Attached three patches refactor the syslogger handling of file based destinations a bit ... and then a lot. First patch adds a new constant to represent both file based destinations. This should make it easier to ensure additional destinations are handled in "For all file destinations..." situations (e.g. when we add the jsonlog destination). Second patch refactors the file descriptor serialization and re-opening in EXEC_BACKEND forking. Previously the code was duplicated for both stderr and csvlog. Again, this should simplify adding new destinations as they'd just invoke the same helper. There's an existing comment about not handling failed opens in syslogger_parseArgs(...) and this patch doesn't fix that, but it does provide a single location to do so in the future. The third patch adds a new internal (to syslogger.c) structure, FileLogDestination, for file based log destinations and modifies the existing handling for syslogFile and csvlogFile to defer to a bunch of helper functions. It also renames "syslogFile" to "stderr_file_info". Working through this patch, it was initially confusing that the stderr log file was named "syslogFile", the C file is named syslogger.c, and there's an entirely separate syslog (the message logging API) destination. I think this clears that up a bit. The patches pass check-world on Linux. I haven't tested it on Windows but it does pass check-world with EXEC_BACKEND defined to try out the forking code paths. Definitely needs some review to ensure it's functionally correct for the different log files. I haven't tried splitting the csvlog code out or adding the new jsonlog atop this yet. There's enough changes here that I'd like to get this settled first. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ From bd5a4ff0435c721de3e7eb9b9207d9e96d79baf4 Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Thu, 16 Sep 2021 14:43:31 -0400 Subject: [PATCH 1/3] Add constant for list of log destinations that use files --- src/backend/postmaster/syslogger.c | 5 ++--- src/include/utils/elog.h | 3 +++ 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index bca3883572..bc546af7ff 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -420,7 +420,7 @@ SysLoggerMain(int argc, char *argv[]) * was sent by pg_rotate_logfile() or "pg_ctl logrotate". */ if (!time_based_rotation && size_rotation_for == 0) -size_rotation_for = LOG_DESTINATION_STDERR | LOG_DESTINATION_CSVLOG; +size_rotation_for = LOG_DESTINATIONS_WITH_FILES; logfile_rotate(time_based_rotation, size_rotation_for); } @@ -1465,8 +1465,7 @@ update_metainfo_datafile(void) FILE *fh; mode_t oumask; - if (!(Log_destination & LOG_DESTINATION_STDERR) && - !(Log_destination & LOG_DESTINATION_CSVLOG)) + if (!(Log_destination & LOG_DESTINATIONS_WITH_FILES)) { if (unlink(LOG_METAINFO_DATAFILE) < 0 && errno != ENOENT) ereport(LOG, diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index f53607e12e..bea8b93da6 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -437,6 +437,9 @@ extern bool syslog_split_messages; #define LOG_DESTINATION_EVENTLOG 4 #define LOG_DESTINATION_CSVLOG 8 +/* Log destinations with file handles */ +#define LOG_DESTINATIONS_WITH_FILES (LOG_DESTINATION_CSVLOG | LOG_DESTINATION_STDERR) + /* Other exported functions */ extern void DebugFileOpen(void); extern char *unpack_sql_state(int sql_state); -- 2.17.1 From 653ea6865ce81de89bec4fc41f3c0be2933e6b2b Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Thu, 16 Sep 2021 15:57:00 -0400 Subject: [PATCH 2/3] Split out syslogger EXEC_BACKEND fd serialization and opening into helper functions --- src/backend/postmaster/syslogger.c | 109 - 1 file changed, 44 insertions(+), 65 deletions(-) diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index bc546af7ff..4f0477794e 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -730,9 +730,23 @@ SysLogger_Start(void) return 0; } - #ifdef EXEC_BACKEND +static long syslogger_get_fileno(FILE *file) +{ +#ifndef WIN32 + if (file != NULL) + return (long) fileno(file); + else + return -1; +#else /* WIN32 */ + if (file != NULL) + return (long) _get_osfhandle(_fileno(file)); + else + return 0; +#endif /* WIN32 */ +} + /* * syslogger_forkexec() - * @@ -751,34 +765,9 @@ syslogger_forkexec(void) av[ac++] = NULL; /* filled in by postmaster_forkexec */ /* static variables (those not passed by write_backend_variables) */ -#ifndef WIN32 - if (syslogFile != NULL) - snprintf(filenobuf, sizeof(filenobuf), "%d", - fileno(syslogFile)); - else - st
Re: Add jsonlog log_destination for JSON server logs
On Sun, Sep 12, 2021 at 10:22 PM Michael Paquier wrote: > On Fri, Sep 10, 2021 at 03:56:18PM +0900, Michael Paquier wrote: > > And this part leads me to the attached, where the addition of the JSON > > format would result in the addition of a couple of lines. > > Okay, I have worked through the first half of the patch set, and > applied the improved versions of 0001 (refactoring of the chunk > protocol) and 0002 (addition of the tests for csvlog). > Thanks. I finally got a chance to look through those changes. I like it. The popcount and pulling out the flags are much easier to follow than the old hard coded letters. > I have not looked in details at 0003 and 0004 yet. Still, here are > some comments after a quick scan. > > + * elog-internal.h > I'd rather avoid the hyphen, and use elog_internal.h. > > +#define FORMATTED_TS_LEN 128 > +extern char formatted_start_time[FORMATTED_TS_LEN]; > +extern char formatted_log_time[FORMATTED_TS_LEN]; > + > +void setup_formatted_log_time(void); > +void setup_formatted_start_time(void); > We could just use a static buffer in each one of those routines, and > return back a pointer to the caller. > +1 > + else if ((Log_destination & LOG_DESTINATION_JSONLOG) && > + (jsonlogFile == NULL || > +time_based_rotation || (size_rotation_for & > LOG_DESTINATION_JSONLOG))) > [...] > + /* Write to JSON log if enabled */ > + else if (Log_destination & LOG_DESTINATION_JSONLOG) > + { > Those bits in 0004 are wrong. They should be two "if" clauses. This > is leading to issues when setting log_destination to multiple values > with jsonlog in the set of values and logging_connector = on, and the > logs are not getting redirected properly to the .json file. We would > get the data for the .log and .csv files though. This issue also > happens with the original patch set applied on top of e757080. I > think that we should also be more careful with the way we free > StringInfoData in send_message_to_server_log(), or we are going to > finish with unwelcome leaks. > Good catch. Staring at that piece again, that's tricky as it tries to aggressively free the buffer before calling write_cvslog(...). Which can't just be duplicated for additional destinations. I think we need to pull up the negative case (i.e. syslogger not available) before the other destinations and if it matches, free and exit early. Otherwise, free the buffer and call whatever destination routines are enabled. > The same coding pattern is now repeated three times in logfile_rotate(): > - Check if a logging type is enabled. > - Optionally open new file, with logfile_open(). > - Business with ENFILE and EMFILE. > - pfree() and save of the static FILE* ane file name for each type. > I think that we have enough material for a wrapper routine that does > this work, where we pass down LOG_DESTINATION_* and pointers to the > static FILE* and the static last file names. That would have avoided > the bug I found above. > I started on a bit of this as well. There's so much overlap already between the syslog_ and csvlog code that I'm going to put that together first. Then the json addition should just fall into a couple of call sites. I'm thinking of adding an internal struct to house the FILE* and file names. Then all the opening, closing, and log rotation code can pass pointers to those and centralize the pfree() and NULL checks. > The rebased patch set, that has reworked the tests to be in line with > HEAD, also fails. They compile. > > Sehrope, could you adjust that? Yes I'm looking to sync those up and address the above later this week. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Add jsonlog log_destination for JSON server logs
Updated patch set is attached. This version splits out the existing csvlog code into its own file and centralizes the common helpers into a new elog-internal.h so that they're only included by the actual write_xyz sources. That makes the elog.c changes in the JSON logging patch minimal as all it's really doing is invoking the new write_jsonlog(...) function. It also adds those missing fields to the JSON logger output. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ From d5b3f5fe44e91d35aefdd570758d5b2a9e9c1a36 Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Wed, 10 Jul 2019 10:02:31 -0400 Subject: [PATCH 1/4] Adds separate dest field to log protocol PipeProtoHeader Adds a separate dest field to PipeProtoHeader to store the log destination requested by the sending process. Also changes the is_last field to only store whether the chunk is the last one for a message rather than also including whether the destination is csvlog. --- src/backend/postmaster/syslogger.c | 15 ++- src/backend/utils/error/elog.c | 4 +++- src/include/postmaster/syslogger.h | 4 ++-- 3 files changed, 11 insertions(+), 12 deletions(-) diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index cad43bdef2..edd8f33204 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -878,7 +878,6 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) { char *cursor = logbuffer; int count = *bytes_in_logbuffer; - int dest = LOG_DESTINATION_STDERR; /* While we have enough for a header, process data... */ while (count >= (int) (offsetof(PipeProtoHeader, data) + 1)) @@ -891,8 +890,9 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) if (p.nuls[0] == '\0' && p.nuls[1] == '\0' && p.len > 0 && p.len <= PIPE_MAX_PAYLOAD && p.pid != 0 && - (p.is_last == 't' || p.is_last == 'f' || - p.is_last == 'T' || p.is_last == 'F')) + (p.is_last == 't' || p.is_last == 'f') && + (p.dest == LOG_DESTINATION_CSVLOG || + p.dest == LOG_DESTINATION_STDERR)) { List *buffer_list; ListCell *cell; @@ -906,9 +906,6 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) if (count < chunklen) break; - dest = (p.is_last == 'T' || p.is_last == 'F') ? -LOG_DESTINATION_CSVLOG : LOG_DESTINATION_STDERR; - /* Locate any existing buffer for this source pid */ buffer_list = buffer_lists[p.pid % NBUFFER_LISTS]; foreach(cell, buffer_list) @@ -924,7 +921,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) free_slot = buf; } - if (p.is_last == 'f' || p.is_last == 'F') + if (p.is_last == 'f') { /* * Save a complete non-final chunk in a per-pid buffer @@ -970,7 +967,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) appendBinaryStringInfo(str, cursor + PIPE_HEADER_SIZE, p.len); - write_syslogger_file(str->data, str->len, dest); + write_syslogger_file(str->data, str->len, p.dest); /* Mark the buffer unused, and reclaim string storage */ existing_slot->pid = 0; pfree(str->data); @@ -979,7 +976,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) { /* The whole message was one chunk, evidently. */ write_syslogger_file(cursor + PIPE_HEADER_SIZE, p.len, - dest); + p.dest); } } diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index a3e1c59a82..cd13111708 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -3250,6 +3250,8 @@ write_pipe_chunks(char *data, int len, int dest) p.proto.nuls[0] = p.proto.nuls[1] = '\0'; p.proto.pid = MyProcPid; + p.proto.dest = (int32) dest; + p.proto.is_last = 'f'; /* write all but the last chunk */ while (len > PIPE_MAX_PAYLOAD) @@ -3264,7 +3266,7 @@ write_pipe_chunks(char *data, int len, int dest) } /* write the last chunk */ - p.proto.is_last = (dest == LOG_DESTINATION_CSVLOG ? 'T' : 't'); + p.proto.is_last = 't'; p.proto.len = len; memcpy(p.proto.data, data, len); rc = write(fd, , PIPE_HEADER_SIZE + len); diff --git a/src/include/postmaster/syslogger.h b/src/include/postmaster/syslogger.h index 1491eecb0f..41d026a474 100644 --- a/src/include/postmaster/syslogger.h +++ b/src/include/postmaster/syslogger.h @@ -46,8 +46,8 @@ typedef struct char nuls[2]; /* always \0\0 */ uint16 len; /* size of this chunk (counts data only) */ int32 pid; /* writer's pid */ - char is_last; /* last chunk of message? 't' or 'f' ('T' or - * 'F' for CSV case) */ + int32 dest; /* log destination */ + char is_last;/* last chunk of message? 't' or 'f'*/ char data[FLEXIBLE_ARRAY_MEMBER]; /* data payload starts here */ } PipeProtoHeader; -- 2.17.1 From dfb17c0b1804b
Re: Add jsonlog log_destination for JSON server logs
On Tue, Aug 31, 2021 at 8:43 PM Michael Paquier wrote: > On Tue, Aug 31, 2021 at 11:34:56AM -0400, Sehrope Sarkuni wrote: > > The second commit adds a TAP test for log_destination "csvlog". This was > > done to both confirm that the previous change didn't break anything and > as > > a skeleton for the test in the next commit. > > +note "Before sleep"; > +usleep(100_000); > +note "Before rotate"; > +$node->logrotate(); > +note "After rotate"; > +usleep(100_000); > > Do you really need a rotation of the log files here? Wouldn't it be > better to grab the position of the current log file with a fixed log > file name, and then slurp the file from this position with your > expected output? That would make the test faster, as well. > Yes, that was intentional. I used the log rotation tap test as a base and kept that piece in there so it verifies that the csv log files are actually rotated. Same for the json logs. > Rather than making elog.c larger, I think that we should try to split > that into more files. Why not refactoring out the CSV part first? > You could just call that csvlog.c, then create a new jsonlog.c for the > meat of the patch. > That's a good idea. I'll try that out. The list of fields is not up to date. At quick glance, you are > missing: > - backend type. > - leader PID. > - query ID. > - Session start timestamp (?) > Thanks. I'll take a look. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Add jsonlog log_destination for JSON server logs
Hi, This patch adds a new log_destination, "jsonlog", that writes log entries as lines of JSON. It was originally started by David Fetter using the jsonlog module by Michael Paquier ( https://github.com/michaelpq/pg_plugins/blob/master/jsonlog/jsonlog.c) as a basis for how to serialize the log messages. Thanks to both of them because this wouldn't be possible without that starting point. The first commit splits out the destination in log pipe messages into its own field. Previously it would piggyback on the "is_last" field. This adds an int to the message size but makes the rest of the code easier to follow. The second commit adds a TAP test for log_destination "csvlog". This was done to both confirm that the previous change didn't break anything and as a skeleton for the test in the next commit. The third commit adds the new log_destination "jsonlog". The output format is one line per entry with the top level output being a JSON object keyed with the log fields. Newlines in the output fields are escaped as \n so the output file has exactly one line per log entry. It also includes a new test for verifying the JSON output with some basic regex checks (similar to the csvlog test). Here's a sample of what the log entries look like: {"timestamp":"2021-08-31 10:15:25.129 EDT","user":"sehrope","dbname":"postgres","pid":12012,"remote_host":"[local]","session_id":"612e397d.2eec","line_num":1,"ps":"idle","session_start":"2021-08-31 10:15:25 EDT","vxid":"3/2","txid":"0","error_severity":"LOG","application_name":" 006_jsonlog.pl","message":"statement: SELECT 1/0"} It builds and passes "make check-world" on Linux. It also includes code to handle Windows as well but I have not actually tried building it there. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ From d5b3f5fe44e91d35aefdd570758d5b2a9e9c1a36 Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Wed, 10 Jul 2019 10:02:31 -0400 Subject: [PATCH 1/3] Adds separate dest field to log protocol PipeProtoHeader Adds a separate dest field to PipeProtoHeader to store the log destination requested by the sending process. Also changes the is_last field to only store whether the chunk is the last one for a message rather than also including whether the destination is csvlog. --- src/backend/postmaster/syslogger.c | 15 ++- src/backend/utils/error/elog.c | 4 +++- src/include/postmaster/syslogger.h | 4 ++-- 3 files changed, 11 insertions(+), 12 deletions(-) diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index cad43bdef2..edd8f33204 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -878,7 +878,6 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) { char *cursor = logbuffer; int count = *bytes_in_logbuffer; - int dest = LOG_DESTINATION_STDERR; /* While we have enough for a header, process data... */ while (count >= (int) (offsetof(PipeProtoHeader, data) + 1)) @@ -891,8 +890,9 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) if (p.nuls[0] == '\0' && p.nuls[1] == '\0' && p.len > 0 && p.len <= PIPE_MAX_PAYLOAD && p.pid != 0 && - (p.is_last == 't' || p.is_last == 'f' || - p.is_last == 'T' || p.is_last == 'F')) + (p.is_last == 't' || p.is_last == 'f') && + (p.dest == LOG_DESTINATION_CSVLOG || + p.dest == LOG_DESTINATION_STDERR)) { List *buffer_list; ListCell *cell; @@ -906,9 +906,6 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) if (count < chunklen) break; - dest = (p.is_last == 'T' || p.is_last == 'F') ? -LOG_DESTINATION_CSVLOG : LOG_DESTINATION_STDERR; - /* Locate any existing buffer for this source pid */ buffer_list = buffer_lists[p.pid % NBUFFER_LISTS]; foreach(cell, buffer_list) @@ -924,7 +921,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) free_slot = buf; } - if (p.is_last == 'f' || p.is_last == 'F') + if (p.is_last == 'f') { /* * Save a complete non-final chunk in a per-pid buffer @@ -970,7 +967,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) appendBinaryStringInfo(str, cursor + PIPE_HEADER_SIZE, p.len); - write_syslogger_file(str->data, str->len, dest); + write_syslogger_file(str->data, str->len, p.dest); /* Mark the buffer unused, and reclaim string storage */ existing_slot->pid = 0; pfree(str->data); @@ -979,7 +976,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_log
Re: Proposal: More structured logging
On Tue, Aug 24, 2021 at 7:22 PM Michael Paquier wrote: > From a code perspective, and while on it, we could split a bit elog.c > and move the log entries generated for each format into their own > file. That would be cleaner for CSV and JSON. As a whole I don't > have an objection with moving the JSON format into core. If one > proposes a patch, feel free to reuse the code from the module I have > released. > I had a patch that does exactly this from a few years ago. It started off copying a bunch out of your json logging module and adds it as a new "jsonlog" destination. It needed some cleanup due to bit rot, but it now builds and works atop master. I'll post it in its own thread. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Add SQL function for SHA1
On Tue, Jan 26, 2021 at 8:53 PM Michael Paquier wrote: > On Tue, Jan 26, 2021 at 10:38:43AM +0100, Daniel Gustafsson wrote: > > Agreed, and pgcrypto already allows for using sha1. > > > > It seems like any legitimate need for sha1 could be better served by an > > extension rather than supplying it in-core. > > Both of you telling the same thing is enough for me to discard this > new stuff. I'd like to refactor the code anyway as that's a nice > cleanup, and this would have the advantage to make people look at > cryptohashfuncs.c if introducing a new type. After sleeping about it, > I think that I would just make MD5 and SHA1 issue an elog(ERROR) if > the internal routine is taken in those cases, like in the attached. > The refactor patch looks good. It builds and passes make check. Thanks for the enum explanation too. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Add SQL function for SHA1
+1 to adding a SHA1 SQL function. Even if it's deprecated, there's plenty of historical usage that I can see it being useful. Either way, the rest of the refactor can be improved a bit to perform a single palloc() and remove the memcpy(). Attached is a diff for cryptohashfuncs.c that does that by writing the digest final directly to the result. It also removes the digest length arg and determines it in the switch block. There's only one correct digest length for each type so there's no reason to give callers the option to give the wrong one. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ diff --git a/src/backend/utils/adt/cryptohashfuncs.c b/src/backend/utils/adt/cryptohashfuncs.c index d99485f4c6..8dc695e80c 100644 --- a/src/backend/utils/adt/cryptohashfuncs.c +++ b/src/backend/utils/adt/cryptohashfuncs.c @@ -15,6 +15,7 @@ #include "common/cryptohash.h" #include "common/md5.h" +#include "common/sha1.h" #include "common/sha2.h" #include "utils/builtins.h" @@ -68,6 +69,77 @@ md5_bytea(PG_FUNCTION_ARGS) PG_RETURN_TEXT_P(cstring_to_text(hexsum)); } +/* + * Internal routine to compute a cryptohash with the given bytea input. + */ +static inline bytea * +cryptohash_internal(pg_cryptohash_type type, bytea *input) +{ + const uint8 *data; + const char *typestr; + int digest_len; + size_t len; + pg_cryptohash_ctx *ctx; + bytea *result; + + switch (type) + { + case PG_MD5: + typestr = "MD5"; + digest_len = MD5_DIGEST_LENGTH; + break; + case PG_SHA1: + typestr = "SHA1"; + digest_len = SHA1_DIGEST_LENGTH; + break; + case PG_SHA224: + typestr = "SHA224"; + digest_len = PG_SHA224_DIGEST_LENGTH; + break; + case PG_SHA256: + typestr = "SHA256"; + digest_len = PG_SHA256_DIGEST_LENGTH; + break; + case PG_SHA384: + typestr = "SHA384"; + digest_len = PG_SHA384_DIGEST_LENGTH; + break; + case PG_SHA512: + typestr = "SHA512"; + digest_len = PG_SHA512_DIGEST_LENGTH; + break; + default: + elog(ERROR, "unsupported digest type %d", type); + } + + result = palloc0(digest_len + VARHDRSZ); + len = VARSIZE_ANY_EXHDR(input); + data = (unsigned char *) VARDATA_ANY(input); + + ctx = pg_cryptohash_create(type); + if (pg_cryptohash_init(ctx) < 0) + elog(ERROR, "could not initialize %s context", typestr); + if (pg_cryptohash_update(ctx, data, len) < 0) + elog(ERROR, "could not update %s context", typestr); + if (pg_cryptohash_final(ctx, (unsigned char *) VARDATA(result)) < 0) + elog(ERROR, "could not finalize %s context", typestr); + pg_cryptohash_free(ctx); + + SET_VARSIZE(result, digest_len + VARHDRSZ); + + return result; +} + +/* + * SHA-1 + */ +Datum +sha1_bytea(PG_FUNCTION_ARGS) +{ + bytea *result = cryptohash_internal(PG_SHA1, PG_GETARG_BYTEA_PP(0)); + + PG_RETURN_BYTEA_P(result); +} /* * SHA-2 variants @@ -76,28 +148,7 @@ md5_bytea(PG_FUNCTION_ARGS) Datum sha224_bytea(PG_FUNCTION_ARGS) { - bytea *in = PG_GETARG_BYTEA_PP(0); - const uint8 *data; - size_t len; - pg_cryptohash_ctx *ctx; - unsigned char buf[PG_SHA224_DIGEST_LENGTH]; - bytea *result; - - len = VARSIZE_ANY_EXHDR(in); - data = (unsigned char *) VARDATA_ANY(in); - - ctx = pg_cryptohash_create(PG_SHA224); - if (pg_cryptohash_init(ctx) < 0) - elog(ERROR, "could not initialize %s context", "SHA224"); - if (pg_cryptohash_update(ctx, data, len) < 0) - elog(ERROR, "could not update %s context", "SHA224"); - if (pg_cryptohash_final(ctx, buf) < 0) - elog(ERROR, "could not finalize %s context", "SHA224"); - pg_cryptohash_free(ctx); - - result = palloc(sizeof(buf) + VARHDRSZ); - SET_VARSIZE(result, sizeof(buf) + VARHDRSZ); - memcpy(VARDATA(result), buf, sizeof(buf)); + bytea *result = cryptohash_internal(PG_SHA224, PG_GETARG_BYTEA_PP(0)); PG_RETURN_BYTEA_P(result); } @@ -105,28 +156,7 @@ sha224_bytea(PG_FUNCTION_ARGS) Datum sha256_bytea(PG_FUNCTION_ARGS) { - bytea *in = PG_GETARG_BYTEA_PP(0); - const uint8 *data; - size_t len; - pg_cryptohash_ctx *ctx; - unsigned char buf[PG_SHA256_DIGEST_LENGTH]; - bytea *result; - - len = VARSIZE_ANY_EXHDR(in); - data = (unsigned char *) VARDATA_ANY(in); - - ctx = pg_cryptohash_create(PG_SHA256); - if (pg_cryptohash_init(ctx) < 0) - elog(ERROR, "could not initialize %s context", "SHA256"); - if (pg_cryptohash_update(ctx, data, len) < 0) - elog(ERROR, "could not update %s context", "SHA256"); - if (pg_cryptohash_final(ctx, buf) < 0) - elog(ERROR, "could not finalize %s context", "SHA256"); - pg_cryptohash_free(ctx); - - result = palloc(sizeof(buf) + VARHDRSZ); - SET_VARSIZE(result, sizeof(buf) + VARHDRSZ); - memcpy(VARDATA(result), buf, sizeof(buf)); + bytea *result = cryptohash_internal(PG_SHA256, PG_GET
Re: Moving other hex functions to /common
The length functions in src/common/hex.c should cast srclen to uint64 prior to the shift. The current hex_enc_len(...) in encode.c performs such a cast. diff --git a/src/common/hex.c b/src/common/hex.c index 0123c69697..e87aa1fd7f 100644 --- a/src/common/hex.c +++ b/src/common/hex.c @@ -178,7 +178,7 @@ pg_hex_decode(const char *src, size_t srclen, char *dst, size_t dstlen) uint64 pg_hex_enc_len(size_t srclen) { - return (srclen << 1); + return (uint64) srclen << 1; } /* @@ -192,5 +192,5 @@ pg_hex_enc_len(size_t srclen) uint64 pg_hex_dec_len(size_t srclen) { - return (srclen >> 1); + return (uint64) srclen >> 1; } Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: 2020-02-13 Press Release Draft
Typo in 9.4 retirement message: s/is it time to retire/it is time to retire/ Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Internal key management system
On Sat, Feb 1, 2020 at 7:02 PM Masahiko Sawada < masahiko.saw...@2ndquadrant.com> wrote: > On Sun, 2 Feb 2020 at 00:37, Sehrope Sarkuni wrote: > > > > On Fri, Jan 31, 2020 at 1:21 AM Masahiko Sawada > > wrote: > > > On Thu, 30 Jan 2020 at 20:36, Sehrope Sarkuni wrote: > > > > That > > > > would allow the internal usage to have a fixed output length of > > > > LEN(IV) + LEN(HMAC) + LEN(DATA) = 16 + 32 + 64 = 112 bytes. > > > > > > Probably you meant LEN(DATA) is 32? DATA will be an encryption key for > > > AES256 (master key) internally generated. > > > > No it should be 64-bytes. That way we can have separate 32-byte > > encryption key (for AES256) and 32-byte MAC key (for HMAC-SHA256). > > > > While it's common to reuse the same 32-byte key for both AES256 and an > > HMAC-SHA256 and there aren't any known issues with doing so, when > > designing something from scratch it's more secure to use entirely > > separate keys. > > The HMAC key you mentioned above is not the same as the HMAC key > derived from the user provided passphrase, right? That is, individual > key needs to have its IV and HMAC key. Given that the HMAC key used > for HMAC(IV || ENCRYPT(KEY, IV, DATA)) is the latter key (derived from > passphrase), what will be the former key used for? It's not derived from the passphrase, it's unlocked by the passphrase (along with the master encryption key). The server will have 64-bytes of random data, saved encrypted in pg_control, which can be treated as two separate 32-byte keys, let's call them master_encryption_key and master_mac_key. The 64-bytes is unlocked by decrypting it with the user passphrase at startup (which itself would be split into a pair of encryption and MAC keys to do the unlocking). The wrap and unwrap operations would use both keys: wrap(plain_text, encryption_key, mac_key) { // Generate random IV: iv = pg_strong_random(16); // Encrypt: cipher_text = encrypt_aes256_cbc(encryption_key, iv, plain_text); // Compute MAC on all inputs: mac = hmac_sha256(mac_key, encryption_key || iv || cipher_text); // Concat user facing pieces together wrapped = mac || iv || cipher_text; return wrapped; } unwrap(wrapped, encryption_key, mac_key) { // Split wrapped into its pieces: actual_mac = wrapped.slice(0, 32); iv = wrapped.slice(0 + 32, 16); cipher_text = wrapped.slice(0 + 32 + 16); // Compute MAC on all inputs: expected_mac = hmac_sha256(mac_key, encryption_key || iv || cipher_text); // Compare MAC vs value in wrapped: if (expected_mac != actual_mac) { return Error("MAC does not match"); } // MAC matches so decrypt: plain_text = decrypt_aes256_cbc(encryption_key, iv, cipher_text); return plain_text; } Every input to the encryption operation, including the encryption key, must be included into the HMAC calculation. If you use the same key for both encryption and MAC that's not required as it's already part of the MAC process as the key. Using separate keys requires explicitly adding in the encryption key into the MAC input to ensure that it the correct key prior to decryption in the unwrap operation. Any additional parts of the wrapped output (ex: a "version" byte for the algos or padding choices) should also be included. The wrap / unwrap above would be used with the encryption and mac keys derived from the user passphrase to unlock the master_encryption_key and master_mac_key from pg_control. Then those would be used by the higher level functions: pg_kmgr_wrap(plain_text) { return wrap(plain_text, master_encryption_key, master_mac_key); } pg_kmgr_unwrap(wrapped) { return unwrap(wrapped, master_encryption_key, master_mac_key); } Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Fri, Jan 31, 2020 at 1:21 AM Masahiko Sawada wrote: > On Thu, 30 Jan 2020 at 20:36, Sehrope Sarkuni wrote: > > That > > would allow the internal usage to have a fixed output length of > > LEN(IV) + LEN(HMAC) + LEN(DATA) = 16 + 32 + 64 = 112 bytes. > > Probably you meant LEN(DATA) is 32? DATA will be an encryption key for > AES256 (master key) internally generated. No it should be 64-bytes. That way we can have separate 32-byte encryption key (for AES256) and 32-byte MAC key (for HMAC-SHA256). While it's common to reuse the same 32-byte key for both AES256 and an HMAC-SHA256 and there aren't any known issues with doing so, when designing something from scratch it's more secure to use entirely separate keys. > > For the user facing piece, padding would enabled to support arbitrary > > input data lengths. That would make the output length grow by up to > > 16-bytes (rounding the data length up to the AES block size) plus one > > more byte if a version field is added. > > I think the length of padding also needs to be added to the output. > Anyway, in the first version the same methods of wrapping/unwrapping > key are used for both internal use and user facing function. And user > input key needs to be a multiple of 16 bytes value. A separate length field does not need to be added as the padding-enabled output will already include it at the end[1]. This would be handled automatically by the OpenSSL encryption / decryption operations (if it's enabled): [1]: https://en.wikipedia.org/wiki/Padding_(cryptography)#PKCS#5_and_PKCS#7 > BTW I think this topic is better to be discussed on a separate thread > as the scope no longer includes TDE. I'll start a new thread for > introducing internal KMS. Good idea. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Jan 29, 2020 at 3:43 AM Masahiko Sawada wrote: > Thank you for the suggestion. I like your suggestion. We can do an > integrity check of the user input wrapped key by using HMAC when > unwrapping. Regarding the output format you meant to use aes-256 > rather than aes-256-key-wrap? I think that DATA in the output is the > user input key so it still must be multiples of 16-bytes if we use > aes-256-key-wrap. Yes I'm suggesting not using the key wrap functions and instead using the regular EVP_aes_256_cbc with a random IV per invocation. For internal usage (e.g. the encrypted key) it does not need padding as we know the input value would always be a multiple of 16-bytes. That would allow the internal usage to have a fixed output length of LEN(IV) + LEN(HMAC) + LEN(DATA) = 16 + 32 + 64 = 112 bytes. For the user facing piece, padding would enabled to support arbitrary input data lengths. That would make the output length grow by up to 16-bytes (rounding the data length up to the AES block size) plus one more byte if a version field is added. > BTW regarding the implementation of cipher function using opensssl in > the src/common I'm concerned whether we should integrate it with the > openssl.c in pgcrypto. Since pgcrypto with openssl currently supports > aes, des and bf etc the cipher function code in this patch also has > similar functionality. Similarly when we introduced SCRAM we moved > sha2 functions from pgcrypto to src/common. I thought we move all > cipher functions in pgcrypto to src/common but it might be overkill > because the internal KMS will use only aes with only 256 key length as > of now. I'd keep the patch smaller and the functions internal to the KMS for now. Maybe address it again after the patch is complete as it'll be easier to see overlaps that could be combined. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
but it's only 64-bits (v.s. say 256-bits for a full HMAC-SHA-256). Rather than use EVP_aes_256_wrap() with its defaults, we can generate a random IV and have the output be "IV || ENCRYPT(KEY, IV, DATA) || HMAC(IV || ENCRYPT(KEY, IV, DATA))". For a fixed length internal input (ex: the KEK encrypted key stored in pg_control) there's no need for padding as we're dealing with multiples of 16-bytes (ex: KEK encrypted enc-key / mac-key would be 64-bytes). It'd also be useful if the user level wrap/unwrap API allowed for arbitrary sized inputs (not just multiples of 16-byte). Having the output be in a standard format (i.e. matching OpenSSL's EVP_aes_256_wrap API) is nice, but as it's meant to be an opaque interface I think it's fine if the output is not usable outside the database. I don't see anyone using the wrapped data directly as it's random bytes without the key. The primary contract for the interface: "data == unwrap(wrap(data))". This would require enabling padding which would round up the size of the output to the next 16-bytes. Adding a prefix byte for a "version" would be nice too as it could be used to infer the specific cipher/mac combo (Ex: v1 would be AES256/HMAC-SHA256). I don't think the added size is an issue as again, the output is opaque. Similar things can also be accomplished by combining the 16-byte only version with pgcrypto but like this it'd be usable out of the box without additional extensions. [1]: https://www.openssl.org/docs/man1.1.1/man3/EVP_aes_256_wrap.html [2]: https://tools.ietf.org/html/rfc5649 Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ diff --git a/configure b/configure index 702adba839..2daa5bf0c2 100755 --- a/configure +++ b/configure @@ -12113,7 +12113,7 @@ done # defines OPENSSL_VERSION_NUMBER to claim version 2.0.0, even though it # doesn't have these OpenSSL 1.1.0 functions. So check for individual # functions. - for ac_func in OPENSSL_init_ssl BIO_get_data BIO_meth_new ASN1_STRING_get0_data + for ac_func in OPENSSL_init_ssl OPENSSL_init_crypto BIO_get_data BIO_meth_new ASN1_STRING_get0_data do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.in b/configure.in index 8165f70039..79329d9f15 100644 --- a/configure.in +++ b/configure.in @@ -1194,7 +1194,7 @@ if test "$with_openssl" = yes ; then # defines OPENSSL_VERSION_NUMBER to claim version 2.0.0, even though it # doesn't have these OpenSSL 1.1.0 functions. So check for individual # functions. - AC_CHECK_FUNCS([OPENSSL_init_ssl BIO_get_data BIO_meth_new ASN1_STRING_get0_data]) + AC_CHECK_FUNCS([OPENSSL_init_ssl OPENSSL_init_crypto BIO_get_data BIO_meth_new ASN1_STRING_get0_data]) # OpenSSL versions before 1.1.0 required setting callback functions, for # thread-safety. In 1.1.0, it's no longer required, and CRYPTO_lock() # function was removed. diff --git a/src/backend/Makefile b/src/backend/Makefile index 9706a95848..4ace302038 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -21,7 +21,7 @@ SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \ main nodes optimizer partitioning port postmaster \ regex replication rewrite \ statistics storage tcop tsearch utils $(top_builddir)/src/timezone \ - jit + jit crypto include $(srcdir)/common.mk diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7f4f784c0e..5293137225 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -41,6 +41,7 @@ #include "catalog/pg_database.h" #include "commands/tablespace.h" #include "common/controldata_utils.h" +#include "crypto/kmgr.h" #include "miscadmin.h" #include "pg_trace.h" #include "pgstat.h" @@ -77,6 +78,7 @@ #include "utils/timestamp.h" extern uint32 bootstrap_data_checksum_version; +extern uint32 bootstrap_data_encryption_cipher; /* Unsupported old recovery command file names (relative to $PGDATA) */ #define RECOVERY_COMMAND_FILE "recovery.conf" @@ -4779,6 +4781,10 @@ ReadControlFile(void) /* Make the initdb settings visible as GUC variables, too */ SetConfigOption("data_checksums", DataChecksumsEnabled() ? "yes" : "no", PGC_INTERNAL, PGC_S_OVERRIDE); + + SetConfigOption("data_encryption_cipher", + kmgr_cipher_string(GetDataEncryptionCipher()), + PGC_INTERNAL, PGC_S_OVERRIDE); } /* @@ -4811,6 +4817,13 @@ GetMockAuthenticationNonce(void) return ControlFile->mock_authentication_nonce; } +WrappedEncKeyWithHmac * +GetMasterEncryptionKey(void) +{ + Assert(ControlFile != NULL); + return &(ControlFile->master_dek); +} + /* * Are checksums enabled for data pages? */ @@ -4821,6 +4834,13 @@ DataChecksumsEn
Re: ssl passphrase callback
On Wed, Nov 13, 2019 at 3:23 PM Tomas Vondra wrote: > I think it would be beneficial to explain why shared object is more > secure than an OS command. Perhaps it's common knowledge, but it's not > quite obvious to me. External command args can be viewed by other OS users (not just the postgres user). For non-sensitive arguments (ex: WAL filename?) that's not an issue but if you plan on passing in something potentially secret value from the backend to the external OS command, that value would be exposed: Terminal 1 (run a command as some other user): $ sudo -u nobody sleep 5 Terminal 2 (view command args as a different non-super user): $ ps -u nobody -o args COMMAND sleep 5 A shared library would not have this problem as the backend directly executes the library in the same process. Has the idea of using environment variables (rather than command line args) for external commands been brought up before? I couldn't find anything in the mailing list archives. Environment variables have the advantage of only being readable by the process owner and super user. They also naturally have a "name" and do not have escaping or quoting issues. For example, archive_command %p could be exposed as PG_ARG_ARCHIVE_PATH and %f could be exposed as PG_ARG_ARCHIVE_FILENAME. Then you could have a script use them via: #!/usr/bin/env bash set -euo pipefail main () { local archive_dir="/path/to/archive_dir/" local archive_file="${archive_dir}${PG_ARG_ARCHIVE_FILENAME}" test ! -f "${archive_file}" && cp -- "${PG_ARG_ARCHIVE_PATH}" "${archive_file}" } main "$@" It's not particularly useful for that basic archive case but if there's something like PG_ARG_SUPER_SECRET then the executed command could receive that value without it being exposed. That'd be useful for something like a callout to an external KMS (key management system). Nothing stops something like this with being used in tandem with string substitution to create the full commands. That'd give backwards compatibility too. The main limitation compared to a shared library is that you'd still have to explicitly pick and name the exposed argument environment variables (i.e. like picking the set of %? substitutions). If there's a generic shared-library-for-external-commands approach then there could be a built-in library that provides this type of "expose as env vars" functionality. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: 2019-11-14 Press Release Draft
> * Several fixes for logical replication, including a failure when the > publisher > & subscriber had different REPLICA IDENTITY columns set. "&" should probably be "and" as I don't see it used like that in any other release notes. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: PostgreSQL 12 RC1 Press Release Draft
The "Upgrading to PostgreSQL 12 RC 1" references v11 rather than v12: "To upgrade to PostgreSQL 11 RC 1 from Beta 4 or an earlier version of PostgreSQL 11, ..." Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Sat, Aug 17, 2019 at 12:43 PM Ibrar Ahmed wrote: > +1 for voice call, bruce we usually have a weekly TDE call. > Please add me to the call as well. Thanks! Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Thu, Aug 8, 2019 at 6:31 PM Stephen Frost wrote: > Strictly speaking, that isn't actually crash recovery, it's physical > replication / HA, and while those are certainly nice to have it's no > guarantee that they're required or that you'd want to have the same keys > for them- conceptually, at least, you could have WAL with one key that > both sides know and then different keys for the actual data files, if we > go with the approach where the WAL is encrypted with one key and then > otherwise is plaintext. > I like the idea of separating the WAL key from the rest of the data files. It'd all be unlocked by the MDEK and you'd still need derived keys per WAL-file, but disconnecting all that from the data files solves a lot of the problems with promoted replicas. This would complicate cloning a replica as using a different MDEK would involve decrypting / encrypting everything rather than just copying the files. Even if that's not baked in a first version, the separation allows for eventually supporting that. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Thu, Aug 8, 2019 at 2:16 PM Bruce Momjian wrote: > On Wed, Aug 7, 2019 at 08:56:18AM -0400, Sehrope Sarkuni wrote: > > Simplest approach for derived keys would be to use immutable attributes > of the > > WAL files as an input to the key derivation. Something like HKDF(MDEK, > "WAL:" | > > So, I am thinking we should use "WAL:" for WAL and "REL:" for heap/index > files. > Sounds good. Any unique convention is fine. Main thing to keep in mind is that they're directly tied to the master key so it's not possible to rotate them without changing the master key. This is in contrast to saving a WDEK key to a file (similar to how the MDEK key would be saved) and unlocking it with the MDEK. That has more moving parts but would allow that key to be independent of the MDEK. In a later message Stephen refers to an example of a replica receiving encrypted WAL and applying it with a different MDEK for the page buffers. That's doable with an independent WDEK. > > | timeline_id || wal_segment_num) should be fine for this as it is: > > I considered using the timeline in the nonce, but then remembered that > in timeline switch, we _copy_ the part of the old WAL up to the timeline > switch to the new timeline; see: > > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlog.c;h=f55352385732c6b0124eff5265462f3883fe7435;hb=HEAD#l5502 > >* Initialize the starting WAL segment for the new timeline. If the > switch >* happens in the middle of a segment, copy data from the last WAL > segment >* of the old timeline up to the switch point, to the starting WAL > segment >* on the new timeline. > > We would need to decrypt/encrypt to do the copy, and just wasn't sure of > the value of the timeline in the nonce. One value to it is that if > there some WAL that generated after the timeline switch in the old > primary that isn't transfered, there would be potentially new data > encrypted with the same key/nonce in the new primary, but if that WAL is > not used, odds are it is gone/destroyed/inaccessible, or it would have > been used during the switchover, so it didn't seem worth worrying about. > > One _big_ reason to add the timeline is if you had a standby that you > recovered and rolled forward only to a specific transaction, then > continued running it as a new primary. In that case, you would have > different WAL encrypted with the same key/nonce, but that sounds like > the same as promoting two standbys, and we should just document not to > do it. > > Maybe we need to consider this further. > Good points. Yes, anything short of generating a new key at promotion time will have these issues. If we're not going to do that, no point in adding the timeline id if it does not change anything. I had thought only the combo was unique but sounds like the segment number is unique on its own. One thing I like about a unique per-file key is that it simplifies the IV generation (i.e. can start at zero). What about discarding the rest of the WAL file at promotion and skipping to a new file? With a random per-file key in the first page header would ensure that going forward all WAL data is encrypted differently. Combine that with independent WAL and MDEK keys and everything would be different between two replicas promoted from the same point. > > A unique WDEK per WAL file that is derived from the segment number would > not > > have that problem. A unique key per-file means the IVs can all start at > zero > > and the each file can be treated as one encrypted stream. Any encryption/ > > decryption code would only need to touch the write/read callsites. > > So, I am now wondering when we should be using a non-zero nonce to > start, and when we should be using derived keys. Should we add the > page-number to the derived key for heap/index files too and just use the > LSN for nonce, or add the LSN to the derived key too? > The main cost of using multiple keys is that you need to derive or unlock them for each usage. A per-type, per-relation, or per-file derived key with the same non-repeating guarantees for the IV (ex: LSN + Page Number) is as secure but allows for caching all needed derived keys in memory (it's one per open file descriptor). Having page-level derived keys incorporating the LSN + Page Number and starting the per-page IV at zero works, but you'd have to perform an HKDF for each page read or write. A cache of those derived keys would be much larger (32-bytes per page) so presumably you're not going to have them all cached or maybe not bother with any caching, > > Even without a per-page MAC, a MAC at some level for WAL has its own > benefits > > such as perfect corruption detection. It could be per-record, > per-N-records, &g
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Aug 7, 2019 at 1:39 PM Bruce Momjian wrote: > On Wed, Aug 7, 2019 at 11:41:51AM -0400, Sehrope Sarkuni wrote: > > On Wed, Aug 7, 2019 at 7:19 AM Bruce Momjian wrote: > > > > On Wed, Aug 7, 2019 at 05:13:31PM +0900, Masahiko Sawada wrote: > > > I understood. IIUC in your approach postgres processes encrypt WAL > > > records when inserting to the WAL buffer. So WAL data is encrypted > > > even on the WAL buffer. > > > > > > I was originally thinking of not encrypting the shared WAL buffers but > that may > > have issues. If the buffers are already encrypted and contiguous in > shared > > memory, it's possible to write out many via a single pg_pwrite(...) call > as is > > currently done in XLogWrite(...). > > The shared buffers will not be encrypted --- they are encrypted only > when being written to storage. We felt encrypting shared buffers will > be too much overhead, for little gain. I don't know if we will encrypt > while writing to the WAL buffers or while writing the WAL buffers to > the file system. > My mistake on the wording. By "shared WAL buffers" I meant the shared memory used for WAL buffers, XLogCtl->pages. Not the shared buffers for pages. > > If they're not encrypted you'd need to do more work in that critical > section. > > That'd involve allocating a commensurate amount of memory to hold the > encrypted > > pages and then encrypting them all prior to the single pg_pwrite(...) > call. > > Reusing one buffer is possible but it would require encrypting and > writing the > > pages one by one. Both of those seem like a bad idea. > > Well, right now the 8k pages is part of the WAL stream, so I don't know > it would be any more overhead than other WAL writes. The total work is the same but when it happens, memory usage, or number of syscalls could change. Right now the XLogWrite(...) code can write many WAL pages at once via a single call to pg_pwrite(...): https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlog.c;h=f55352385732c6b0124eff5265462f3883fe7435;hb=HEAD#l2491 If the blocks are not encrypted then you either need to allocate and encrypt everything (could be up to wal_buffers max size) to do it as one write, or encrypt chunks of WAL and do multiple writes. I'm not sure how big an issue this would be in practice as it'd be workload specific. > I am hoping we can > generate the encryption bit stream in chunks earlier so we can just to > the XOR was we are writing the data to the WAL buffers. > For pure CTR that sounds doable as it'd be the same as doing an XOR with encrypted zero. Anything with a built-in MAC like GCM would not work though (I'm not proposing we use that, just keeping it in mind). You'd also increase your memory requirements (one allocation for the encrypted stream and one for the encrypted data right?). > > Better to pay the encryption cost at the time of WAL record creation and > keep > > the writing process as fast and simple as possible. > > Yes, I don't think we know at the time of WAL record creation what > _offset_ the records will have when then are written to WAL, so I am > thinking we need to do it later, and as I said, I am hoping we can > generate the encryption bit stream earlier. > > > > It works but I think the implementation might be complex; For > example > > > using openssl, we would use EVP functions to encrypt data by > > > AES-256-CTR. We would need to make IV and pass it to them and these > > > functions however don't manage the counter value of nonce as long > as I > > > didn't miss. That is, we need to calculate the correct counter > value > > > for each encryption and pass it to EVP functions. Suppose we > encrypt > > > 20 bytes of WAL. The first 16 bytes is encrypted with nonce of > > > (segment_number, 0) and the next 4 bytes is encrypted with nonce of > > > (segment_number, 1). After that suppose we encrypt 12 bytes of > WAL. We > > > cannot use nonce of (segment_number, 2) but should use nonce of > > > (segment_number , 1). Therefore we would need 4 bytes padding and > to > > > encrypt it and then to throw that 4 bytes away . > > > > Since we want to have per-byte control over encryption, for both > > heap/index pages (skip LSN and CRC), and WAL (encrypt to the last > byte), > > I assumed we would need to generate a bit stream of a specified size > and > > do the XOR ourselves against the data. I assume ssh does this, so we > > would have to study the method. > > > > > > The lower level non-EVP
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Aug 7, 2019 at 7:19 AM Bruce Momjian wrote: > On Wed, Aug 7, 2019 at 05:13:31PM +0900, Masahiko Sawada wrote: > > I understood. IIUC in your approach postgres processes encrypt WAL > > records when inserting to the WAL buffer. So WAL data is encrypted > > even on the WAL buffer. > I was originally thinking of not encrypting the shared WAL buffers but that may have issues. If the buffers are already encrypted and contiguous in shared memory, it's possible to write out many via a single pg_pwrite(...) call as is currently done in XLogWrite(...). If they're not encrypted you'd need to do more work in that critical section. That'd involve allocating a commensurate amount of memory to hold the encrypted pages and then encrypting them all prior to the single pg_pwrite(...) call. Reusing one buffer is possible but it would require encrypting and writing the pages one by one. Both of those seem like a bad idea. Better to pay the encryption cost at the time of WAL record creation and keep the writing process as fast and simple as possible. > > It works but I think the implementation might be complex; For example > > using openssl, we would use EVP functions to encrypt data by > > AES-256-CTR. We would need to make IV and pass it to them and these > > functions however don't manage the counter value of nonce as long as I > > didn't miss. That is, we need to calculate the correct counter value > > for each encryption and pass it to EVP functions. Suppose we encrypt > > 20 bytes of WAL. The first 16 bytes is encrypted with nonce of > > (segment_number, 0) and the next 4 bytes is encrypted with nonce of > > (segment_number, 1). After that suppose we encrypt 12 bytes of WAL. We > > cannot use nonce of (segment_number, 2) but should use nonce of > > (segment_number , 1). Therefore we would need 4 bytes padding and to > > encrypt it and then to throw that 4 bytes away . > > Since we want to have per-byte control over encryption, for both > heap/index pages (skip LSN and CRC), and WAL (encrypt to the last byte), > I assumed we would need to generate a bit stream of a specified size and > do the XOR ourselves against the data. I assume ssh does this, so we > would have to study the method. > The lower level non-EVP OpenSSL functions allow specifying the offset within the 16-byte AES block from which the encrypt/decrypt should proceed. It's the "num" parameter of their encrypt/decrypt functions. For a continuous encrypted stream such as a WAL file, a "pread(...)" of a possibly non-16-byte aligned section would involve determining the 16-byte counter (byte_offset / 16) and the intra-block offset (byte_offset % 16). I'm not sure how one handles initializing the internal encrypted counter and that might be one more step that would need be done. But it's definitely possible to read / write less than a block via those APIs (not the EVP ones). I don't think the EVP functions have parameters for the intra-block offset but you can mimic it by initializing the IV/block counter and then skipping over the intra-block offset by either reading or writing a dummy partial block. The EVP read and write functions both deal with individual bytes so once you've seeked to your desired offset you can read or write the real individual bytes. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Aug 5, 2019 at 9:02 PM Bruce Momjian wrote: > On Wed, Jul 31, 2019 at 09:25:01AM -0400, Sehrope Sarkuni wrote: > > Even if we do not include a separate per-relation salt or things like > > relfilenode when generating a derived key, we can still include other > types of > > immutable attributes. For example the fork type could be included to > eventually > > allow multiple forks for the same relation to be encrypted with the same > IV = > > LSN + Page Number as the derived key per-fork would be distinct. > > Yes, the fork number could be useful in this case. I was thinking we > would just leave the extra bits as zeros and we can then set it to '1' > or something else for a different fork. > Key derivation has more flexibility as you're not limited by the number of unused bits in the IV. > > WAL encryption should not use the same key as page encryption so there's > no > > need to design the IV to try to avoid matching the page IVs. Even a basic > > derivation with a single fixed WDEK = HKDF(MDEK, "WAL") and TDEK = > HKDF(MDEK, > > "PAGE") would ensure separate keys. That's the the literal string "WAL" > or > > "PAGE" being added as a salt to generate the respective keys, all that > matters > > is they're different. > > I was thinking the WAL would use the same key since the nonce is unique > between the two. What value is there in using a different key? > Never having to worry about overlap in Key + IV usage is main advantage. While it's possible to structure IVs to avoid that from happening, it's much easier to completely avoid that situation by ensuring different parts of an application are using separate derived keys. > > Ideally WAL encryption would generating new derived keys as part of the > WAL > > stream. The WAL stream is not fixed so you have the luxury of being able > to add > > a "Use new random salt XZY going forward" records. Forcing generation of > a new > > salt/key upon promotion of a replica would ensure that at least the WAL > is > > unique going forward. Could also generate a new upon server startup, > after > > Ah, yes, good point, and using a derived key would make that easier. > The tricky part is what to use to create the new derived key, unless we > generate a random number and store that somewhere in the data directory, > but that might lead to fragility, so I am worried. Simplest approach for derived keys would be to use immutable attributes of the WAL files as an input to the key derivation. Something like HKDF(MDEK, "WAL:" || timeline_id || wal_segment_num) should be fine for this as it is: * Unique per WAL file * Known prior to writing to a given WAL file * Known prior to reading a given WAL file * Does not require any additional persistence We have pg_rewind, > which allows to make the WAL go backwards. What is the value in doing > this? > Good point re: pg_rewind. Having key rotation records in the stream would complicate that as you'd have to jump back / forward to figure out which key to use. It's doable but much more complicated. A unique WDEK per WAL file that is derived from the segment number would not have that problem. A unique key per-file means the IVs can all start at zero and the each file can be treated as one encrypted stream. Any encryption/decryption code would only need to touch the write/read callsites. > > every N bytes, or a new one for each new WAL file. There's much more > > flexibility compared to page encryption. > > > > As WAL is a single continuous stream, we can start the IV for each > derived WAL > > key from zero. There's no need to complicate it further as Key + IV will > never > > be reused. > > Uh, you want a new random key for each WAL file? I was going to use the > WAL segment number as the nonce, which is always increasing, and easily > determined. The file is 16MB. > Ideally yes as it would allow for multiple replicas promoted off the same primary to immediately diverge as each would have its own keys. I don't consider it a requirement but if it's possible without significant added complexity I say that's a win. I'm still reading up on the file and record format to understand how complex that would be. Though given your point re: pg_rewind and the lack of handling for page encryption divergence when promoting multiple replicas, I doubt the complexity will be worth it. > > If WAL is always written as full pages we need to ensure that the empty > parts > > of the page are actual zeros and not "encrypted zeroes". Otherwise an > XOR of > > the empty section of the first write of a page against a subsequent one > would > > give you the plain text. > > Right, I think w
Re: Fix typos
On Fri, Aug 2, 2019 at 12:11 AM Michael Paquier wrote: > On Thu, Aug 01, 2019 at 11:01:59PM -0400, Alvaro Herrera wrote: > > I think slight variations don't really detract from the value of the > > product, and consider the odd variation a reminder of the diversity of > > the project. I don't suggest that we purposefully introduce spelling > > variations, or that we refrain from fixing ones that appear in code > > we're changing, but I don't see the point in changing a line for the > > sole reason of standardising the spelling of a word. > > Agreed. This always reminds me of ANALYZE vs. ANALYSE where we don't > actually document the latter :) > I didn't know about that. That's a fun one! > > > That said, I'm not a native English speaker. > > Neither am I. > I am. Consistency is nice but either reads fine to me. Only brought it up as I didn't see many other usages so seemed out of place. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Fix typos
On Thu, Aug 1, 2019 at 10:18 PM Tom Lane wrote: > It's British vs. American spelling. For the most part, Postgres > follows American spelling, but there's the odd Briticism here and > there. Thanks for the explanation. I thought that might be the case but didn't find any other usages of "serialise" so was not sure. > I'm not sure whether it's worth trying to standardize. > I think the most recent opinion on this was Munro's: > > > https://www.postgresql.org/message-id/ca+hukgjz-pdmgwxroiwvn-aeg4-ajdwj3gwdqkosa8g65sp...@mail.gmail.com Either reads fine to me and the best rationale I can think of for going with one spelling is to not have the same "fix" come up again. If there is a desire to change this, attached is updated to include one more instance of "materialise" and a change to the commit message to match some similar ones I found in the past. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ From 20c018b9350cb17e87930cebb66b715bc4b9fcf4 Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Fri, 2 Aug 2019 05:58:35 -0400 Subject: [PATCH] Use American spelling for "serialize" and "materalize" --- contrib/pgstattuple/pgstatapprox.c | 2 +- contrib/xml2/xpath.c| 2 +- src/backend/access/transam/README | 4 ++-- src/backend/storage/buffer/bufmgr.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c index 636c8d40ac..83b3270554 100644 --- a/contrib/pgstattuple/pgstatapprox.c +++ b/contrib/pgstattuple/pgstatapprox.c @@ -278,7 +278,7 @@ pgstattuple_approx_internal(Oid relid, FunctionCallInfo fcinfo) errmsg("cannot access temporary tables of other sessions"))); /* - * We support only ordinary relations and materialised views, because we + * We support only ordinary relations and materialized views, because we * depend on the visibility map and free space map for our estimates about * unscanned pages. */ diff --git a/contrib/xml2/xpath.c b/contrib/xml2/xpath.c index 1e5b71d9a0..11749b6c47 100644 --- a/contrib/xml2/xpath.c +++ b/contrib/xml2/xpath.c @@ -573,7 +573,7 @@ xpath_table(PG_FUNCTION_ARGS) errmsg("xpath_table must be called as a table function"))); /* - * We want to materialise because it means that we don't have to carry + * We want to materialize because it means that we don't have to carry * libxml2 parser state between invocations of this function */ if (!(rsinfo->allowedModes & SFRM_Materialize)) diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index ad4083eb6b..49b6809c82 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -609,8 +609,8 @@ or more buffer locks be held concurrently, you must lock the pages in appropriate order, and not release the locks until all the changes are done. Note that we must only use PageSetLSN/PageGetLSN() when we know the action -is serialised. Only Startup process may modify data blocks during recovery, -so Startup process may execute PageGetLSN() without fear of serialisation +is serialized. Only Startup process may modify data blocks during recovery, +so Startup process may execute PageGetLSN() without fear of serialization problems. All other processes must only call PageSet/GetLSN when holding either an exclusive buffer lock or a shared lock plus buffer header lock, or be writing the data block directly rather than through shared buffers diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 6f3a402854..3d9e0a3488 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3519,7 +3519,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) /* * Set the page LSN if we wrote a backup block. We aren't supposed * to set this when only holding a share lock but as long as we - * serialise it somehow we're OK. We choose to set LSN while + * serialize it somehow we're OK. We choose to set LSN while * holding the buffer header lock, which causes any reader of an * LSN who holds only a share lock to also obtain a buffer header * lock before using PageGetLSN(), which is enforced in -- 2.17.1
Fix typos
Hi, Attached fixes some typos for "serialise" => "serialize" and "materialise" => "materialize". Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ From 30d34d082120ac2c041a4ad45e9d6e31b0ea9c9d Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Thu, 1 Aug 2019 08:00:23 -0400 Subject: [PATCH] Fix typos --- contrib/xml2/xpath.c| 2 +- src/backend/access/transam/README | 4 ++-- src/backend/storage/buffer/bufmgr.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/contrib/xml2/xpath.c b/contrib/xml2/xpath.c index 1e5b71d9a0..11749b6c47 100644 --- a/contrib/xml2/xpath.c +++ b/contrib/xml2/xpath.c @@ -573,7 +573,7 @@ xpath_table(PG_FUNCTION_ARGS) errmsg("xpath_table must be called as a table function"))); /* - * We want to materialise because it means that we don't have to carry + * We want to materialize because it means that we don't have to carry * libxml2 parser state between invocations of this function */ if (!(rsinfo->allowedModes & SFRM_Materialize)) diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index ad4083eb6b..49b6809c82 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -609,8 +609,8 @@ or more buffer locks be held concurrently, you must lock the pages in appropriate order, and not release the locks until all the changes are done. Note that we must only use PageSetLSN/PageGetLSN() when we know the action -is serialised. Only Startup process may modify data blocks during recovery, -so Startup process may execute PageGetLSN() without fear of serialisation +is serialized. Only Startup process may modify data blocks during recovery, +so Startup process may execute PageGetLSN() without fear of serialization problems. All other processes must only call PageSet/GetLSN when holding either an exclusive buffer lock or a shared lock plus buffer header lock, or be writing the data block directly rather than through shared buffers diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 6f3a402854..3d9e0a3488 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3519,7 +3519,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) /* * Set the page LSN if we wrote a backup block. We aren't supposed * to set this when only holding a share lock but as long as we - * serialise it somehow we're OK. We choose to set LSN while + * serialize it somehow we're OK. We choose to set LSN while * holding the buffer header lock, which causes any reader of an * LSN who holds only a share lock to also obtain a buffer header * lock before using PageGetLSN(), which is enforced in -- 2.17.1
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Jul 31, 2019 at 2:32 AM Masahiko Sawada wrote: > Just to confirm, we have 21 bits left for nonce in CTR? We have LSN (8 > bytes), page-number (4 bytes) and counter (11 bits) in 16 bytes nonce > space. Even though we have 21 bits left we cannot store relfilenode to > the IV. > Fields like relfilenode, database, or tablespace could be added to the derived key, not the per-page IV. There's no space limitations as they are additional inputs into the HKDF (key derivation function). Additional salt of any size, either some stored random value or something deterministic like the relfilenode/database/tablespace, can be added to the HKDF. There's separate issues with including those specific fields but it's not a size limitation. > For WAL encryption, before flushing WAL we encrypt whole 8k WAL page > and then write only the encrypted data of the new WAL record using > pg_pwrite() rather than write whole encrypted page. So each time we > encrypt 8k WAL page we end up with encrypting different data with the > same key+nonce but since we don't write to the disk other than space > where we actually wrote WAL records it's not a problem. Is that right? > Ah, this is what I was referring to in my previous mail. I'm not familiar with how the writes happen yet (reading up...) but, yes, we would need to ensure that encrypted data is not written more than once (i.e. no writing of encrypt(zero) followed by writing of encrypt(non-zero) at the same spot). Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jul 30, 2019 at 4:48 PM Bruce Momjian wrote: > I had more time to think about the complexity of adding relfilenode to > the IV. Since relfilenode is only unique within a database/tablespace, > we would need to have pg_upgrade preserve database/tablespace oids > (which I assume are the same as the directory and tablespace symlinks). > Then, to decode a page, you would need to look up those values. This is > in addition to the new complexity of CREATE DATABASE and moving files > between tablespaces. I am also concerned that crash recovery operations > and cluster forensics and repair would need to also deal with this. > > I am not even clear if pg_upgrade preserving relfilenode is possible --- > when we wrap the relfilenode counter, does it start at 1 or at the > first-user-relation-oid? If the former, it could conflict with oids > assigned to new system tables in later major releases. Tying the > preservation of relations to two restrictions seems risky. > Agreed. Unless you know for sure the input is going to immutable across copies or upgrades, including anything in either the IV or key derivation gets risky and could tie you down for the future. That's partly why I like the idea separate salt (basically you directly pay for the complexity by tracking that). Even if we do not include a separate per-relation salt or things like relfilenode when generating a derived key, we can still include other types of immutable attributes. For example the fork type could be included to eventually allow multiple forks for the same relation to be encrypted with the same IV = LSN + Page Number as the derived key per-fork would be distinct. > Using just the page LSN and page number allows a page to be be > decrypted/encrypted independently of its file name, tablespace, and > database, and I think that is a win for simplicity. Of course, if it is > insecure we will not do it. > As LSN + Page Number combo is unique for all relations (not just one relation) I think we're good for pages. I am thinking for the heap/index IV, it would be: > > uint64 lsn; > unint32 page number; > /* only uses 11 bits for a zero-based CTR counter for 32k pages */ > uint32 counter; > Looks good. > and for WAL it would be: > > uint64 segment_number; > uint32counter; > /* guarantees this IV doesn't match any relation IV */ > uint32 2^32-1 /* all 1's */ > I need to read up more on the structure of the WAL records but here's some high level thoughts: WAL encryption should not use the same key as page encryption so there's no need to design the IV to try to avoid matching the page IVs. Even a basic derivation with a single fixed WDEK = HKDF(MDEK, "WAL") and TDEK = HKDF(MDEK, "PAGE") would ensure separate keys. That's the the literal string "WAL" or "PAGE" being added as a salt to generate the respective keys, all that matters is they're different. Ideally WAL encryption would generating new derived keys as part of the WAL stream. The WAL stream is not fixed so you have the luxury of being able to add a "Use new random salt XZY going forward" records. Forcing generation of a new salt/key upon promotion of a replica would ensure that at least the WAL is unique going forward. Could also generate a new upon server startup, after every N bytes, or a new one for each new WAL file. There's much more flexibility compared to page encryption. As WAL is a single continuous stream, we can start the IV for each derived WAL key from zero. There's no need to complicate it further as Key + IV will never be reused. If WAL is always written as full pages we need to ensure that the empty parts of the page are actual zeros and not "encrypted zeroes". Otherwise an XOR of the empty section of the first write of a page against a subsequent one would give you the plain text. The non-fixed size of the WAL allows for the addition of a MAC though I'm not sure yet the best way to incorporate it. It could be part of each encrypted record or its own summary record (providing a MAC for a series of WAL records). After I've gone through this a bit more I'm looking to put together a write up with this and some other thoughts in one place. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jul 30, 2019 at 10:06 AM Masahiko Sawada wrote: > On Tue, Jul 30, 2019 at 5:03 AM Sehrope Sarkuni > wrote: > > > > On Mon, Jul 29, 2019 at 9:44 AM Bruce Momjian wrote: > >> > >> > Checking that all buffers using a single LSN are from the same > >> > relation would be a good idea but I think it's hard to test it and > >> > regard the test result as okay. Even if we passed 'make checkworld', > >> > it might still be possible to happen. And even assertion failures > >> > >> Yes, the problem is that if you embed the relfilenode or tablespace or > >> database in the encryption IV, you then need to then make sure you > >> re-encrypt any files that move between these. I am hesitant to do that > >> since it then requires these workarounds for encryption going forward. > >> We know that most people will not be using encryption, so that will not > >> be well tested either. For pg_upgrade, I used a minimal-impact > >> approach, and it has allowed dramatic changes in our code without > >> requiring changes and retesting of pg_upgrade. > > > > > > Will there be a per-relation salt stored in a separate file? I saw it > mentioned in a few places (most recently > https://www.postgresql.org/message-id/aa386c3f-fb89-60af-c7a3-9263a633ca1a%40postgresql.org) > but there's also discussion of trying to make the TDEK unique without a > separate salt so I'm unsure. > > > > With a per-relation salt there is no need to include fixed attributes > (database, relfilenode, or tablespace) to ensure the derived key is unique > per relation. A long salt (32-bytes from /dev/urandom) alone guarantees > that uniqueness. Copying or moving files would then be possible by also > copying the salt. It does not need to be a salt per file on disk either, > one salt can be used for many files for the same relation by including the > fork number, type, or segment in the TDEK derivation (so each file on disk > for that relation ends up with a unique TDEK). > > If we can derive unique TDEK using (database, tablespace, relfilenode) > as info I think it's better to use it rather than using random salt > per relations since it doesn't require additional information we need > to store. As described in HKDF RFC[1], if the input key is already > present as a cryptographically strong key we can skip the extract part > where use a salt. > > [1] https://tools.ietf.org/html/rfc5869 Yes a random salt is not required for security reasons. Any unique values mixed into the HKDF is fine and the derived keys will still be unique. The HKDF ensures that uniqueness. The separate salt allows you to disconnect the key derivation from the physical attributes of the file. The physical attributes (ex: database, tablespace, file node) are very convenient as they're unique and do not require additional storage. However using them prevents copying or moving the encrypted files as one or more of them would be different at the destination (so the derived key would no longer decrypt the existing data). So you would have to decrypt / encrypt everything as part of a copy. If copying raw files without a decrypt/encrypt cycle is desired then the key derivation cannot include physical attributes (or per Bruce's note above, there would be no separate key derivation relation). I thought it'd be a nice property to have as it limits the amount of code that needs to be crypto aware (ex: copying a database or moving a table to a different tablespace would not change beyond ensuring the salt is also copied). Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jul 30, 2019 at 8:16 AM Bruce Momjian wrote: > On Tue, Jul 30, 2019 at 07:44:20AM -0400, Sehrope Sarkuni wrote: > > If each relation file has its own derived key, the derived TDEK for that > > relation file, then there is no issue with reusing the same IV = LSN || > Page > > Number. The TDEKs will be different so Key + IV will never collide. > > So, this email explains that we are considering not using the > relfilenode/tablepsace/database to create a derived key per relation, > but us the same key for all relaions because the IV will be unique per > page across all relations: > > > https://www.postgresql.org/message-id/20190729134442.2bxakegiqafxg...@momjian.us > > There is talk of using a derived key with a contant to make sure all > heap/index files use a different derived key than WAL, but I am not > sure. This is related to whether WAL IV and per-heap/index IV can > collide. > Ah, I read that to imply that derived keys were a must. Specifically this piece at the end: >From Joe's email on 2019-07-13 18:41:34: >> Based on all of that I cannot find a requirement that we use more than one key per database. >> >> But I did find that files in an encrypted file system are encrypted with derived keys from a master key, and I view this as analogous to what we are doing. I read that as the "one key per database" is the MDEK. And read the part about derived keys as referring to separate derived keys for relations. Perhaps I misread and it was referring to different keys for WAL vs pages. > There are other emails in the thread that also discuss the topic. The > issue is that we add a lot of complexity to other parts of the system, > (e.g. pg_upgrade, CREATE DATABASE, moving relations between tablespaces) > to create a derived key, so we should make sure we need it before we do > it. > Yes it definitely complicates things both on the derivation and potential additional storage for the salts (they're small and fixed size, but you still need to put it somewhere). I think key rotation for TDEK will be impossible without some stored salt and per-relation derived key. It might not be needed in a first cut though as the "default salt" could be no salt or a place holder of all zeroes. Even if the rotation itself is out of scope for a first pass the potential to eventually add it should be there. Should keep in mind that because we do not have a MAC on the encrypted pages we'll need to know which derived key to use. We can't try multiple options to see which is correct as any key would "succeed" with garbage decrypted data. > > In general it's fine to use the same IV with different keys. Only reuse > of Key > > + IV is a problem and the entire set of possible counter values (IV + 0, > IV + > > 1, ...) generated with a key must be unique. That's also why we must > leave at > > least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the IV to be > filled > > in with 0, 1, 2, ... for each 16-byte AES-block on the page. If our > per-page IV > > prefix used any of those bits then the counter could overflow into the > next > > page's IV's range. > > Agreed. > > Attached is an updated patch that checks only main relation forks, which > I think are the only files we are going ot encrypt, and it has better > comments. > Okay that makes sense in the context of using a single key and relying on the LSN based IV to be unique. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jul 30, 2019 at 8:16 AM Masahiko Sawada wrote: > On Mon, Jul 29, 2019 at 8:18 PM Sehrope Sarkuni > wrote: > > > > On Mon, Jul 29, 2019 at 6:42 AM Masahiko Sawada > wrote: > > > > An argument could be made to push that problem upstream, i.e. let the > > > > supplier of the passphrase deal with the indirection. You would still > > > > need to verify the supplied passphrase/key is correct via something > > > > like authenticating against a stored MAC. > > > > > > So do we need the key for MAC of passphrase/key in order to verify? > > > > Yes. Any 128 or 256-bit value is a valid AES key and any 16-byte input > > can be "decrypted" with it in both CTR and CBC mode, you'll just end > > up with garbage data if the key does not match. Verification of the > > key prior to usage (i.e. starting DB and encrypting/decrypting data) > > is a must as otherwise you'll end up with all kinds of corruption or > > data loss. > > > > Do you mean that we encrypt and store a 16 byte input with the correct > key to the disk, and then decrypt it with the user supplied key and > compare the result to the input data? > Yes but we don't compare via decryption of a known input. We instead compute a MAC of the encrypted master key using the user supplied key, and compare that against an expected MAC stored alongside the encrypted master key. The pseudo code would be something like: // Read key text from user: string raw_kek = read_from_user() // Normalize it to a fixed size of 64-bytes byte[64] kek = SHA512(SHA512(raw_kek)) // Split the 64-bytes into a separate encryption and MAC key byte[32] user_encryption_key = kek.slice(0,32) byte[32] user_mac_key = kek.slice(32, 64) // Read our saved MAC and encrypted master key byte[80] mac_iv_encrypted_master_key = read_from_file() // First 32-bytes is the MAC of the rest byte[32] expected_mac = mac_iv_encrypted_master_key.slice(0, 32) // Rest is a random IV + Encrypted master key byte[48] iv_encrypted_master_key = mac_iv_encrypted_master_key(32, 80) // Compute the MAC with the user supplied key byte[32] actual_mac = HMAC(user_mac_key, iv_encrypted_master_key) // If it does not match then the user key is invalid if (actual_mac != expected_mac) { print_err_and_exit("Bad user key!") } // Our MAC was correct // ... so we know user supplied key is valid // ... and we know our iv and encrypted_key are valid byte[16] iv = iv_encrypted_master_key.slice(0,16) byte[32] encrypted_master_key = iv_encrypted_master_key.slice(16, 48) // ... so we can use all three to decrypt the master key (MDEK) byte[32] master_key = decrypt_aes_cbc(user_encryption_key, iv, encrypted_master_key) > From a single user supplied passphrase you would derive the MDEK and > > compute a MAC (either using the same key or via a separate derived > > MDEK-MAC key). If the computed MAC matches against the previously > > stored value then you know the MDEK is correct as well. > > You meant KEK, not MDEK? > If the KEK is incorrect then the MAC validation would fail and the decrypt would never be attempted. If the MAC matches then both the KEK (user supplied key) and MDEK ("master_key" in the pseudo code above) would be confirmed to be valid. So the MDEK is safe to use for deriving keys for encrypt / decrypt. I'm using the definitions for "KEK" and "MDEK" from Joe's mail https://www.postgresql.org/message-id/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 29, 2019 at 8:35 PM Bruce Momjian wrote: > On Sun, Jul 28, 2019 at 10:33:03PM -0400, Bruce Momjian wrote: > > I am thinking of writing some Assert() code that checks that all buffers > > using a single LSN are from the same relation (and therefore different > > page numbers). I would do it by creating a static array, clearing it on > > XLogBeginInsert(), adding to it for each XLogInsert(), then checking on > > PageSetLSN() that everything in the array is from the same file. Does > > that make sense? > > So, I started looking at how to implement the Assert checks and found > that Heikki has already added (in commit 2c03216d83) Assert checks to > avoid duplicate block numbers in WAL. I just added the attached patch > to check that all RelFileNodes are the same. > >From the patch: /* ! * The initialization vector (IV) is used for page-level ! * encryption. We use the LSN and page number as the IV, and IV ! * values must never be reused since it is insecure. It is safe ! * to use the LSN on multiple pages in the same relation since ! * the page number is part of the IV. It is unsafe to reuse the ! * LSN in different relations because the page number might be ! * the same, and hence the IV. Therefore, we check here that ! * we don't have WAL records for different relations using the ! * same LSN. ! */ If each relation file has its own derived key, the derived TDEK for that relation file, then there is no issue with reusing the same IV = LSN || Page Number. The TDEKs will be different so Key + IV will never collide. In general it's fine to use the same IV with different keys. Only reuse of Key + IV is a problem and the entire set of possible counter values (IV + 0, IV + 1, ...) generated with a key must be unique. That's also why we must leave at least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the IV to be filled in with 0, 1, 2, ... for each 16-byte AES-block on the page. If our per-page IV prefix used any of those bits then the counter could overflow into the next page's IV's range. I ran the regression tests with asserts on and got no failures, so I > think we are good. > It's not strictly required but it also doesn't hurt that LSN is unique per-relation so that's still good news! Might be useful for something down the road like a separate stream of MACs computed per-LSN. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 29, 2019 at 4:15 PM Alvaro Herrera wrote: > On 2019-Jul-27, Sehrope Sarkuni wrote: > > > Given the non-cryptographic nature of CRC and its 16-bit size, I'd > > round down the malicious tamper detection it provides to zero. At best > > it catches random disk errors so might as well keep it in plain text > > and checkable offline. > > But what attack are we protecting against? We fear that somebody will > steal a disk or a backup. We don't fear that they will *write* data. > The CRC is there to protect against data corruption. So whether or not > the CRC protects against malicious tampering is beside the point. > That was in response to using an encrypted CRC for tamper detection. I agree that it does not provide meaningful protection so there is no point in adding complexity to use it for that. I agree it's better to leave the CRC as-is for detecting corruption which also has the advantage of playing nice with existing checksum tooling. > If we were trying to protect against an attacker having access to > *writing* data in the production server, this encryption scheme is > useless: they could just as well read unencrypted data from shared > buffers anyway. > The attack situation is someone being able to modify pages at the storage tier. They cannot necessarily read server memory or the encryption key, but they could make changes to existing data or an existing backup that would be subsequently read by the server. Dealing with that is way out of scope but similar to the replica promotion I think it should be kept track of and documented. > I think trying to protect against malicious data tampering is a second > step *after* this one is done. > +1 Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 29, 2019 at 4:10 PM Alvaro Herrera wrote: > On 2019-Jul-27, Bruce Momjian wrote: > > > I think using LSN and page number, we will _never_ reuse the IV, except > > for cases like promoting two standbys, which I think we have to document > > as an insecure practice. > > Actually, why is it an insecure practice? If you promote two standbys, > then the encrypted pages are the same pages, so it's not two different > messages with the same key/IV -- they're still *one* message. And as > soon as they start getting queries, they will most likely diverge > because the LSNs of records after the promotion will (most likely) no > longer match. It takes one different WAL record length for the > "encryption histories" to diverge completely ... > You could have a sequence of post promotion events like: # Replica 1 LSN=t+0 Operation A LSN=t+1 Operation B ... LSN=t+n Operation C # Replica 2 LSN=t+0 Operation X LSN=t+1 Operation Y ... LSN=t+n Operation Z If the LSN and modified page numbers of C and Z are the same ... and the net effect of Z is known (ex: setting a bunch of bytes on the row to zero) ... and you can read the encrypted pages of both replicas (ex: have access to the encrypted storage tier but not necessarily the live server) ... then you can XOR the encrypted pages to get the plain text for the bytes after operation C. Yes, it's not likely and yes it has a lot of "if..." involved, but it is possible. I don't think this will be an issue in practice, but it should be documented. Otherwise, it's not unreasonable for someone to expect that a promoted replica would use be using new keys for everything after each promotion. Encryption for WAL can avoid this type of problem entirely by generating a new random salt and adding a "Use new salt XYZ for WDEK going forward" record. The two replicas would generate different salts so all subsequent encrypted WAL data would be different (even the exact same records). Unfortunately, that doesn't work for pages without a lot more complexity to keep track of which key version to use based upon the LSN. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 29, 2019 at 9:44 AM Bruce Momjian wrote: > > Checking that all buffers using a single LSN are from the same > > relation would be a good idea but I think it's hard to test it and > > regard the test result as okay. Even if we passed 'make checkworld', > > it might still be possible to happen. And even assertion failures > > Yes, the problem is that if you embed the relfilenode or tablespace or > database in the encryption IV, you then need to then make sure you > re-encrypt any files that move between these. I am hesitant to do that > since it then requires these workarounds for encryption going forward. > We know that most people will not be using encryption, so that will not > be well tested either. For pg_upgrade, I used a minimal-impact > approach, and it has allowed dramatic changes in our code without > requiring changes and retesting of pg_upgrade. > Will there be a per-relation salt stored in a separate file? I saw it mentioned in a few places (most recently https://www.postgresql.org/message-id/aa386c3f-fb89-60af-c7a3-9263a633ca1a%40postgresql.org) but there's also discussion of trying to make the TDEK unique without a separate salt so I'm unsure. With a per-relation salt there is no need to include fixed attributes (database, relfilenode, or tablespace) to ensure the derived key is unique per relation. A long salt (32-bytes from /dev/urandom) alone guarantees that uniqueness. Copying or moving files would then be possible by also copying the salt. It does not need to be a salt per file on disk either, one salt can be used for many files for the same relation by including the fork number, type, or segment in the TDEK derivation (so each file on disk for that relation ends up with a unique TDEK). There's the usual gotchas of copying encrypted data, i.e. it's exactly the same so clearly they're equal. But any subsequent changes would have a different LSN and encrypt differently going forward. If the main use cases are copying an entire database or moving a tablespace, having that be simpler/faster seems like a good idea. It could be a known limitation like the promoting of multiple replicas. Plus with a key rotation tool anyone that wants everything re-encrypted could run one after the copy. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 29, 2019 at 6:42 AM Masahiko Sawada wrote: > > An argument could be made to push that problem upstream, i.e. let the > > supplier of the passphrase deal with the indirection. You would still > > need to verify the supplied passphrase/key is correct via something > > like authenticating against a stored MAC. > > So do we need the key for MAC of passphrase/key in order to verify? Yes. Any 128 or 256-bit value is a valid AES key and any 16-byte input can be "decrypted" with it in both CTR and CBC mode, you'll just end up with garbage data if the key does not match. Verification of the key prior to usage (i.e. starting DB and encrypting/decrypting data) is a must as otherwise you'll end up with all kinds of corruption or data loss. >From a single user supplied passphrase you would derive the MDEK and compute a MAC (either using the same key or via a separate derived MDEK-MAC key). If the computed MAC matches against the previously stored value then you know the MDEK is correct as well. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 29, 2019 at 4:39 AM Masahiko Sawada wrote: > After more thoughts, I'm confused why we need to have MDEK. We can use > KEK derived from passphrase and TDEK and WDEK that are derived from > KEK. That way, we don't need store any key in database file. What is > the advantage of 3-tier key architecture? The separate MDEK serves a couple purposes: 1. Allows for rotating the passphrase without actually changing any of the downstream derived keys. 2. Verification that the passphrase itself is correct by checking if it can unlock and authenticate (via a MAC) the MDEK. 3. Ensures it's generated from a strong random source (ex: /dev/urandom). If the MDEK was directly derived via a deterministic function of the passphrase, then that passphrase could never change as all your derived keys would also change (and thus could not be decrypt their existing data). The encrypted MDEK provides a level of indirection for passphrase rotation. An argument could be made to push that problem upstream, i.e. let the supplier of the passphrase deal with the indirection. You would still need to verify the supplied passphrase/key is correct via something like authenticating against a stored MAC. If you're going to do that, might as well directly support decrypting and managing your own MDEK. That also let's you ensure it was properly generated via strong random source. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Sat, Jul 27, 2019 at 1:32 PM Bruce Momjian wrote: > Uh, I listed the three options for the CRC and gave the benefits of > each: > > > https://www.postgresql.org/message-id/20190725200343.xo4dcjm5azrfn...@momjian.us > > Obviously I was not clear on the benefits. To quote: > > 1. compute CRC and then encrypt everything > 3. encrypt and then CRC, and store the CRC encrypted > > Numbers 1 & 3 give us tampering detection, though with the CRC being so > small, it isn't totally secure. > > > Are you worried about an attacker forging the page checksum by > > installing another encrypted page that gives the same checksum? I'm not > > sure how that attack works ... I mean why can the attacker install > > arbitrary pages? > > Well, with #2 > > 2 encrypt and then CRC, and store the CRC unchanged > > you can modify the page, even small parts, and just replace the CRC to > match your changes. In #1 and #3, you would get a CRC error in almost > all cases since you have no way of setting the decrypted CRC without > knowing the key. You can change the encrypted CRC, but the odds that > the decrypted one would match the page is very slim. Regarding #1 and #3, with CTR mode you do not need to know the key to make changes to the CRC. Flipping bits of the encrypted CRC would flip the same bits of the decrypted one. This was one of the issues with the older WiFi encryption standard WEP[1] which used RC4 + CRC32. It's not the exact same usage pattern, but I wouldn't be surprised if there is a way to make in place updates and matching CRC32 changes even if it's encrypted. Given the non-cryptographic nature of CRC and its 16-bit size, I'd round down the malicious tamper detection it provides to zero. At best it catches random disk errors so might as well keep it in plain text and checkable offline. More generally, without a cryptographic MAC I don't think it's possible to provide any meaningful malicious tamper detection. And even that would have to be off-page to deal with page replay (which I think is out of scope). [1]: https://en.wikipedia.org/wiki/CRC-32#Data_integrity Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: fsync error handling in pg_receivewal, pg_recvlogical
While reviewing this patch I read through some of the other fsync callsites and noticed this typo (walkdir is in file_utils.c, not initdb.c) too: diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 315c74c745..9b79df2d7f 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -3208,7 +3208,7 @@ SyncDataDirectory(void) * * Errors are reported at level elevel, which might be ERROR or less. * - * See also walkdir in initdb.c, which is a frontend version of this logic. + * See also walkdir in file_utils.c, which is a frontend version of this logic. */ static void walkdir(const char *path, Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: fsync error handling in pg_receivewal, pg_recvlogical
Hi, Tried out this patch and it applies, compiles, and passes check-world. Also flipped things around in pg_recvlogical.c to exit-on-success to ensure it's actually being called and that worked too. Outside of a more complicated harness that simulates fsync errors not sure how else to test this further. I did some searching and found a FUSE based on that looks interesting: CharybdeFS[1]. Rather than being fixed at mount time, it has a client/server interface so you can change the handling of syscalls on the fly[2]. For example you can error out fsync calls halfway through a test rather than always or randomly. Haven't tried it out but leaving it here as it seems relevant. [1]: https://github.com/scylladb/charybdefs [2]: https://www.scylladb.com/2016/05/02/fault-injection-filesystem-cookbook/ On Wed, Jun 26, 2019 at 12:11 AM Michael Paquier wrote: > Why using a different error code. Using EXIT_FAILURE is a more common > practice in the in-core binaries. The patch looks fine to me except > that, that's a good first cut. > An error code specific to fsync issues could help with tests as the harness could check it to ensure things died for the right reasons. With a generic "messed up fsync" harness you might even be able to run some existing tests that would otherwise pass and check for the fsync-specific exit code. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Thu, Jul 25, 2019 at 8:50 PM Bruce Momjian wrote: > On Thu, Jul 25, 2019 at 08:44:40PM -0400, Sehrope Sarkuni wrote: > > You can still use CTR mode and include those to make the key + IV unique > by > > adding them to the derived key rather than the IV. > > > > The IV per-page would still be LSN + page-number (with the block number > added > > as it's evaluated across the page) and the relfilenode, heap/index, > database, > > and anything else to make it unique can be included in the HKDF to > create the > > per-file derived key. > > I thought if we didn't have to hash the stuff together we would be less > likely to get collisions with the IV. > IV creation not use any hashing and would never have collisions with the same key as it's LSN + page + block (concatenation). The derived keys would also not have collisions as the HKDF prevents that. Deriving two matching keys with different inputs has the same chance as randomly generating matching HMACs (effectively nil with something like HMAC-SHA-256). So there wouldn't be any reuse of the same key + IV. Even if two different files are encrypted with the same LSN + page the total operation (key + IV) would be different as they'd be using different derived keys. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Thu, Jul 25, 2019 at 7:51 PM Bruce Momjian wrote: > Looking at the bits we have, the IV for AES is 16 bytes. Since we know > we have to use LSN (to change the IV for each page write), and the page > number (so WAL updates that change multiple pages with the same LSN use > different IVs), that uses 12 bytes: > > LSN 8 bytes > page-number 4 bytes > > That leaves 4 bytes unused. If we use CTR, we need 11 bits for the > counter to support 32k pages sizes (per Sehrope Sarkuni), and we can use > the remaining 5 bits as constants to indicate heap, index, or WAL. > (Technically, since we are not encrypting the first 16 bytes, we could > use one less bit for the counter.) If we also use relfilenode, that is > 4 bytes, so we have no bits for the heap/index/WAL constant, and no > space for the CTR counter, meaning we would have to use CBC mode. > You can still use CTR mode and include those to make the key + IV unique by adding them to the derived key rather than the IV. The IV per-page would still be LSN + page-number (with the block number added as it's evaluated across the page) and the relfilenode, heap/index, database, and anything else to make it unique can be included in the HKDF to create the per-file derived key. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Sat, Jul 20, 2019 at 1:30 PM Tomas Vondra wrote: > Forbid checksums? I don't see how that could be acceptable. We either have > to accept the limitations of the current design (having to decrypt > everything before checking the checksums) or change the design. > > I personally think we should do the latter - not just because of this > "decrypt-then-verify" issue, but consider how much work we've done to > allow enabling checksums on-line (it's still not there, but it's likely > doable in PG13). How are we going to do that with encryption? ISTM we > should design it so that we can enable encryption on-line too - maybe not > in v1, but it should be possible. So how are we going to do that? With > checksums it's (fairly) easy because we can "not verify" the page before > we know all pages have a checksum, but with encryption that's not > possible. And if the whole page is encrypted, then what? > > Of course, maybe we don't need such capability for the use-case we're > trying to solve with encryption. I can imagine that someone is running a > large system, has issues with data corruption, and decides to enable > checksums to remedy that. Maybe there's no such scenario in the privacy > case? But we can probably come up with scenarios where a new company > policy forces people to enable encryption on all systems, or something > like that. > > That being said, I don't know how to solve this, but it seems to me that > any system where we can't easily decide whether a page is encrypted or not > (because everything including the page header) is encrypted has this > exact issue. Maybe we could keep some part of the header unencrypted > (likely an information leak, does not solve decrypt-then-verify). Or maybe > we need to store some additional information on each page (which breaks > on-disk format). How about storing the CRC of the encrypted pages? It would not leak any additional information and serves the same purpose as a non-encrypted one, namely I/O corruption detection. I took a look at pg_checksum and besides checking for empty pages, the checksum validation path does not interpret any other fields to calculate the checksum. I think even the offline checksum enabling path looks like it may work out of the box. Checksums of encrypted data are not a replacement for a MAC and this would allow that validation to run without any knowledge of keys. Related, I think CTR mode should be considered for pages. It has performance advantages at 8K data sizes, but even better, allows for arbitrary bytes of the cipher text to be replaced. For example, after encrypting a block you can replace the two checksum bytes with the CRC of the cipher text v.s. CBC mode where that would cause corruption to cascade forward. Same could be used for leaving things like pd_pagesize_version in plaintext at its current offset. For anything deemed non-sensitive, having it readable without having to decrypt the page is useful. It does not have to be full bytes either. CTR mode operates as a stream of bits so its possible to replace nibbles or even individual bits. It can be something as small one bit for an "is_encrypted" flag or a handful of bits used to infer a derived key. For example, with 2-bits you could have 00 represent unencrypted, 01/10 represent old/new key, and 11 be future use. Something like that could facilitate online key rotation. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Hi, Some more thoughts on CBC vs CTR modes. There are a number of advantages to using CTR mode for page encryption. CTR encryption modes can be fully parallelized, whereas CBC can only parallelized for decryption. While both can use AES specific hardware such as AES-NI, CTR modes can go a step further and use vectorized instructions. On an i7-8559U (with AES-NI) I get a 4x speed improvement for CTR-based modes vs CBC when run on 8K of data: # openssl speed -evp ${cipher} type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc1024361.51k 1521249.60k 1562033.41k 1571663.87k 1574537.90k 1575512.75k aes-128-ctr 696866.85k 2214441.86k 4364903.85k 5896221.35k 6559735.81k 6619594.75k aes-128-gcm 642758.92k 1638619.09k 3212068.27k 5085193.22k 6366035.97k 6474006.53k aes-256-cbc 940906.25k 1114628.44k 1131255.13k 1138385.92k 1140258.13k 1143592.28k aes-256-ctr 582161.82k 1896409.32k 3216926.12k 4249708.20k 4680299.86k 4706375.00k aes-256-gcm 553513.89k 1532556.16k 2705510.57k 3931744.94k 4615812.44k 4673093.63k For relation data where the encryption is going to be per page, there's flexibility in how the CTR nonce (IV + counter) is generated. With an 8K page, the counter need only go up to 512 for each page (8192-bytes per page / 16-bytes per AES-block). That would require 9-bits for the counter. Rounding that up to 16-bits allows for wider pages and it still uses only two bytes of the counter while ensuring that it'd be unique per AES-block. The remaining 14-bytes would be populated with some other data that is guaranteed unique per page-write to allow encryption via the same per-relation-file derived key. From what I gather, the LSN is a candidate though it'd have to be stored in plaintext for decryption. What's important is that writing the two pages (either different locations or the same page back again) never reuses the same nonce with the same key. Using the same nonce with a different key is fine. With any of these schemes the same inputs will generate the same outputs. With CTR mode for WAL this would be an issue if the same key and deterministic nonce (ex: LSN + offset) is reused in multiple places. That does not have to be the same cluster either. For example if two replicas are promoted from the same backup with the same master key, they would generate the same WAL CTR stream, reusing the key/nonce pair. Ditto for starting off with a master key and deriving per-relation keys in a cloned installation off some deterministic attribute such as oid. This can be avoided by deriving new keys per file (not just per relation) from a random salt. It'd be stored out of band and combined with the master key to derive the specific key used for that CTR stream. If there's a desire for supporting multiple ciphers or key sizes, that could be stored alongside the salt. Perhaps use the same location or lack of it to indicate "not encrypted" as well. Per-file salts and derived keys would facilitate re-keying a table piecemeal, file by file, by generating a new salt/derived-key, encrypting a copy of the decrypted file, and doing an atomic rename. The files contents would change but its length and any references to pages or byte offsets would stay valid. (I think this would work for CBC modes too as there's nothing CTR specific about it.) I'm not sure of is how to handle randomizing the relation file IV in a cloned database. Until the key for a relation file or segment is rotated it'd have the same deterministic IV generated as its source as the LSN would continue from the same point. One idea is with 128-bits for the IV, one could have 64-bits for LSN, 16-bits for AES-block counter, and the remaining 48-bits be randomized; though you'd need to store those 48-bits somewhere per-page (basically it's a salt per page). That'd give some protection from the clone's new data be encrypted with the same stream as the parent's. Another option would be to track ranges of LSNs and have a centralized list of 48-bit randomized salts. That would remove the need for additional salt per page though you'd have to do a lookup on that shared list to figure out which to use. CTR mode is definitely more complicated than a pure random-IV + CBC but with any deterministic generation of IVs for CBC mode you're going to have some of these same problems as well. Regarding CRCs, CTR mode has the advantage of not destroying the rest of the stream to replace the CRC bytes. With CBC mode any change would cascade and corrupt the rest of data the down stream from that block. With CTR mode you can overwrite the CRC's location with the CRC or a truncated MAC of the encrypted data as each byte is encrypted separately. At decryption time you simply ignore the decrypted output of those bytes and zero them out again. A CRC of encrypted data (but not a partial MAC) could be checked offline without access to the key. Regards, -- Sehrope Sarkuni Fou
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Thu, Jul 11, 2019 at 9:05 PM Bruce Momjian wrote: > > On Thu, Jul 11, 2019 at 08:41:52PM -0400, Joe Conway wrote: > > I vote for AES 256 rather than 128. > > Why? This page seems to think 128 is sufficient: > > > https://crypto.stackexchange.com/questions/20/what-are-the-practical-differences-between-256-bit-192-bit-and-128-bit-aes-enc > > For practical purposes, 128-bit keys are sufficient to ensure > security. > The larger key sizes exist mostly to satisfy some US military > regulations which call for the existence of several distinct "security > levels", regardless of whether breaking the lowest level is already > far > beyond existing technology. > > We might need to run some benchmarks to determine the overhead of going > to AES256, because I am unclear of the security value. If the algorithm and key size is not going to be configurable then better to lean toward the larger size, especially given the desire for future proofing against standards evolution and potential for the encrypted data to be very long lived. NIST recommends AES-128 or higher but there are other publications that recommend AES-256 for long term usage: NIST - 2019 : Recommends AES-128, AES-192, or AES-256. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-131Ar2.pdf NSA - 2016 : Recommends AES-256 for future quantum resistance. https://apps.nsa.gov/iaarchive/library/ia-guidance/ia-solutions-for-classified/algorithm-guidance/cnsa-suite-and-quantum-computing-faq.cfm ECRYPT - 2015 - Recommends AES-256 for future quantum resistance. https://www.ecrypt.eu.org/csa/documents/PQC-whitepaper.pdf ECRYPT - 2018 - Recommends AES-256 for long term use. https://www.ecrypt.eu.org/csa/documents/D5.4-FinalAlgKeySizeProt.pdf Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Refactoring syslogger piping to simplify adding new log destinations
Hi, While working on adding a new log_destination I noticed that the syslogger piping would need to be updated. At the moment both ends only handle stderr/csvlog as the pipe message header has a char "is_last" that is either t/f (stderr last, stderr partial) or T/F (csvlog last, csvlog partial). Couple approaches came to mind: 1. Use additional pairs of chars for each additional destination (e.g. x/X, y/Y, ...) and mimic the logic of csvlog. 2. Repurpose the char "is_last" as a bitmap of the log destination with the highest order bit indicating whether it's the last chunk. 3. Add a separate field "dest" for the log destination and leave "is_last" as a t/f indicating whether it's the last chunk. Attached are patches for each approach (fun exercise!). Also attached is a basic TAP test to invoke the csvlog destination. It's a clone of pg_ctl log rotation test that looks for .csv logs. If there's interest in the test I was thinking of expanding it a bit to include "big" output that would span multiple messages to test the partial/combining path. My thoughts on the approaches: #1 doesn't change the header types or size but seems ugly as it leads to new pairs of constants and logic in multiple places. In particular, both send and receive ends have to encode and decode the destination. #2 is cleaner as there's a logical separation of the dest fields and no need for new constant pairs when adding new destinations. Would also need to ensure new LOG_DESTINATION_xyz constants do not use that last bit (there's already four now so room for three more). #3 leads to the cleanest code though you lose 4-bytes of max data size per chunk. Which would be preferable? I'd like to validate the approach as the new log destination would be built atop it. I leaning toward #3 though if the 4-byte loss is a deal breaker then #2. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ From 9ec38a587b0c2645bc9fd73398c1debdf9fa962b Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Wed, 10 Jul 2019 09:27:44 -0400 Subject: [PATCH 1/2] Add basic test for csvlog --- src/bin/pg_ctl/t/005_csvlog.pl | 94 ++ 1 file changed, 94 insertions(+) create mode 100644 src/bin/pg_ctl/t/005_csvlog.pl diff --git a/src/bin/pg_ctl/t/005_csvlog.pl b/src/bin/pg_ctl/t/005_csvlog.pl new file mode 100644 index 00..b74c373bb6 --- /dev/null +++ b/src/bin/pg_ctl/t/005_csvlog.pl @@ -0,0 +1,94 @@ +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 4; +use Time::HiRes qw(usleep); + +# Set up node with logging collector +my $node = get_new_node('primary'); +$node->init(); +$node->append_conf( + 'postgresql.conf', qq( +logging_collector = on +lc_messages = 'C' +log_destination = 'csvlog' +)); + +$node->start(); + +# Verify that log output gets to the file + +$node->psql('postgres', 'SELECT 1/0'); + +my $current_logfiles = slurp_file($node->data_dir . '/current_logfiles'); + +note "current_logfiles = $current_logfiles"; + +like( + $current_logfiles, + qr|^csvlog log/postgresql-.*csv$|, + 'current_logfiles is sane'); + +my $lfname = $current_logfiles; +$lfname =~ s/^csvlog //; +chomp $lfname; + +# might need to retry if logging collector process is slow... +my $max_attempts = 180 * 10; + +my $first_logfile; +for (my $attempts = 0; $attempts < $max_attempts; $attempts++) +{ + $first_logfile = slurp_file($node->data_dir . '/' . $lfname); + last if $first_logfile =~ m/division by zero/; + usleep(100_000); +} + +like($first_logfile, qr/division by zero/, 'found expected log file content'); + +# Sleep 2 seconds and ask for log rotation; this should result in +# output into a different log file name. +sleep(2); +$node->logrotate(); + +# pg_ctl logrotate doesn't wait for rotation request to be completed. +# Allow a bit of time for it to happen. +my $new_current_logfiles; +for (my $attempts = 0; $attempts < $max_attempts; $attempts++) +{ + $new_current_logfiles = slurp_file($node->data_dir . '/current_logfiles'); + last if $new_current_logfiles ne $current_logfiles; + usleep(100_000); +} + +note "now current_logfiles = $new_current_logfiles"; + +like( + $new_current_logfiles, + qr|^csvlog log/postgresql-.*csv$|, + 'new current_logfiles is sane'); + +$lfname = $new_current_logfiles; +$lfname =~ s/^csvlog //; +chomp $lfname; + +# Verify that log output gets to this file, too + +$node->psql('postgres', 'fee fi fo fum'); + +my $second_logfile; +for (my $attempts = 0; $attempts < $max_attempts; $attempts++) +{ + $second_logfile = slurp_file($node->data_dir . '/' . $lfname); + last if $second_logfile =~ m/syntax error/; + usleep(100_000); +} + +like( + $second_logfile, + qr/syntax error/, + 'found expected log file content in new log file'); + +$node->stop(); -- 2.17.1 From 7b9d827e5059445945d214388371138f0676b306 Mon Sep 17 00:00:00 2001 From: Sehro