Re: Password leakage avoidance

2024-01-06 Thread Sehrope Sarkuni
Scratch that, rather than memset(...) should be explicit_bzero(...) so it
doesn't get optimized out. Same idea though.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

>


Re: Password leakage avoidance

2024-01-06 Thread Sehrope Sarkuni
On Sat, Jan 6, 2024 at 12:39 PM Joe Conway  wrote:

> The only code specific comments were Tom's above, which have been
> addressed. If there are no serious objections I plan to commit this
> relatively soon.
>

One more thing that we do in pgjdbc is to zero out the input password args
so that they don't remain in memory even after being freed. It's kind of
odd in Java as it makes the input interface a char[] and we have to convert
them to garbage collected Strings internally (which kind of defeats the
purpose of the exercise).

But in libpq could be done via something like:

memset(pw1, 0, strlen(pw1));
memset(pw2, 0, strlen(pw2));

There was some debate on our end of where to do that and we settled on
doing it inside the encoding functions to ensure it always happens. So the
input password char[] always gets wiped regardless of how the encoding
functions are invoked.

Even if it's not added to the password encoding functions (as that kind of
changes the after effects if anything was relying on the password still
having the password), I think it'd be good to add it to the command.c stuff
that has the two copies of the password prior to freeing them.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Password leakage avoidance

2024-01-06 Thread Sehrope Sarkuni
On Sat, Jan 6, 2024 at 11:53 AM Joe Conway  wrote:

> On 1/2/24 07:23, Sehrope Sarkuni wrote:
> > 1. There's two sets of defaults, the client program's default and the
> > server's default. Need to pick one for each implemented function. They
> > don't need to be the same across the board.
>
> Is there a concrete recommendation here?
>

Between writing that and now, we scrapped the "driver picks" variant of
this. Now we only have explicit functions for each of the encoding types
(i.e. one for md5 and one for SCRAM-SHA-256) and an alterUserPassword(...)
method that uses the default for the database via reading the
password_encrypotion GUC. We also have some javadoc comments on the
encoding functions to strongly suggest using the SCRAM functions and only
use the md5 directly for legacy servers.

The "driver picks" one was removed to prevent a situation where an end user
picks the driver default and it's not compatible with their server. The
rationale was if the driver's SCRAM-SHA-256 default is ever replaced with
something else (e.g. SCRAM-SHA-512) we'd end up with an interim state where
an upgraded driver application would try to use that newer encryption
method on an old DB. If a user is going to do that, they would have to be
explicit with their choice of encoding functions (hence removing the
"driver picks" variant).

So the recommendation is to have explicit functions for each variant and
have the end-to-end change password code read from the DB.

My understanding of this patch is that it does exactly that.


> > 2. Password encoding should be split out and made available as its own
> > functions. Not just as part of a wider "create _or_ alter a user's
> > password" function attached to a connection.
>
> It already is split out in libpq[1], unless I don't understand you
> correctly.
>

Sorry for the confusion. My original list wasn't any specific contrasts
with what libpq is doing. Was more of a summary of thoughts having just
concluded implementing the same type of password change stuff in PGJDBC.

>From what I've seen in this patch, it either aligns with how we did things
in PGJDBC or it's something that isn't as relevant in this context (e.g.
generating the SQL text as a public function).

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Password leakage avoidance

2024-01-02 Thread Sehrope Sarkuni
Having worked on and just about wrapped up the JDBC driver patch for this,
couple thoughts:

1. There's two sets of defaults, the client program's default and the
server's default. Need to pick one for each implemented function. They
don't need to be the same across the board.
2. Password encoding should be split out and made available as its own
functions. Not just as part of a wider "create _or_ alter a user's
password" function attached to a connection. We went a step further and
added an intermediate function that generates the "ALTER USER ... PASSWORD"
SQL.
3. We only added an "ALTER USER ... PASSWORD" function, not anything to
create a user. There's way too many options for that and keeping this
targeted at just assigning passwords makes it much easier to test.
4. RE:defaults, the PGJDBC approach is that the encoding-only function uses
the driver's default (SCRAM). The "alterUserPassword(...)" uses the
server's default (again usually SCRAM for modern installs but could be
something else). It's kind of odd that they're different but the use cases
are different as well.
5. Our SCRAM specific function allows for customizing the algo iteration
and salt parameters. That topic came up on hackers previously[1]. Our high
level "alterUserPassword(...)" function does not have those options but it
is available as part of our PasswordUtil SCRAM API for advanced users who
want to leverage it. The higher level functions have defaults for iteration
counts (4096) and salt size (16-bytes).
6. Getting the verbiage right for the PGJDBC version was kind of annoying
as we wanted to match the server's wording, e.g. "password_encryption", but
it's clearly hashing, not encryption. We settled on "password encoding" for
describing the overall task with the comments referencing the server's
usage of the term "password_encryption". Found a similar topic[2] on
changing that recently as well but looks like that's not going anywhere.

[1]:
https://www.postgresql.org/message-id/1d669d97-86b3-a5dc-9f02-c368bca911f6%40iki.fi
[2]:
https://www.postgresql.org/message-id/flat/ZV149Fd2JG_OF7CM%40momjian.us#cc97d20ff357a9e9264d4ae14e96e566

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Sehrope Sarkuni
On Wed, Dec 6, 2023 at 4:29 PM Joe Conway  wrote:

> > 1. Outputting a top level JSON object without the additional column
> > keys. IIUC, the top level keys are always the column names. A common use
> > case would be a single json/jsonb column that is already formatted
> > exactly as the user would like for output. Rather than enveloping it in
> > an object with a dedicated key, it would be nice to be able to output it
> > directly. This would allow non-object results to be outputted as well
> > (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is
> > structured, I think this would play nice with the JSON lines v.s. array
> > concept.
> >
> > COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM
> > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
> > SOME_OPTION_TO_NOT_ENVELOPE)
> > {"foo":1}
> > {"foo":2}
> > {"foo":3}
>
> Your example does not match what you describe, or do I misunderstand? I
> thought your goal was to eliminate the repeated "foo" from each row...
>

The "foo" in this case is explicit as I'm adding it when building the
object. What I was trying to show was not adding an additional object
wrapper / envelope.

So each row is:

{"foo":1}

Rather than:

"{"json_build_object":{"foo":1}}

If each row has exactly one json / jsonb field, then the user has already
indicated the format for each row.

That same mechanism can be used to remove the "foo" entirely via a
json/jsonb array.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Sehrope Sarkuni
On Wed, Dec 6, 2023 at 4:03 PM Andrew Dunstan  wrote:

> > The output size difference does say that maybe we should pay some
> > attention to the nearby request to not always label every field.
> > Perhaps there should be an option for each row to transform to
> > a JSON array rather than an object?
>
> I doubt it. People who want this are likely to want pretty much what
> this patch is providing, not something they would have to transform in
> order to get it. If they want space-efficient data they won't really be
> wanting JSON. Maybe they want Protocol Buffers or something in that vein.
>

For arrays v.s. objects, it's not just about data size. There are plenty of
situations where a JSON array is superior to an object (e.g. duplicate
column names). Lines of JSON arrays of strings is pretty much CSV with JSON
escaping rules and a pair of wrapping brackets. It's common for tabular
data in node.js environments as you don't need a separate CSV parser.

Each one has its place and a default of the row_to_json(...) representation
of the row still makes sense. But if the user has the option of outputting
a single json/jsonb field for each row without an object or array wrapper,
then it's possible to support all of these use cases as the user can
explicitly pick whatever envelope makes sense:

-- Lines of JSON arrays:
COPY (SELECT json_build_array('test-' || a, b) FROM generate_series(1, 3)
a, generate_series(5,6) b) TO STDOUT WITH (FORMAT JSON,
SOME_OPTION_TO_DISABLE_ENVELOPE);
["test-1", 5]
["test-2", 5]
["test-3", 5]
["test-1", 6]
["test-2", 6]
["test-3", 6]

-- Lines of JSON strings:
COPY (SELECT to_json('test-' || x) FROM generate_series(1, 5) x) TO STDOUT
WITH (FORMAT JSON, SOME_OPTION_TO_DISABLE_ENVELOPE);
"test-1"
"test-2"
"test-3"
"test-4"
"test-5"

I'm not sure how I feel about the behavior being automatic if it's a single
top level json / jsonb field rather than requiring the explicit option.
It's probably what a user would want but it also feels odd to change the
output wrapper automatically based on the fields in the response. If it is
automatic and the user wants the additional envelope, the option always
exists to wrap it further in another: json_build_object('some_field",
my_field_i_want_wrapped)

The duplicate field names would be a good test case too. I haven't gone
through this patch but I'm guessing it doesn't filter out duplicates so the
behavior would match up with row_to_json(...), i.e. duplicates are
preserved:

=> SELECT row_to_json(t.*) FROM (SELECT 1 AS a, 2 AS a) t;
  row_to_json
---
 {"a":1,"a":2}

If so, that's a good test case to add as however that's handled should be
deterministic.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Sehrope Sarkuni
Big +1 to this overall feature.

This is something I've wanted for a long time as well. While it's possible
to use a COPY with text output for a trivial case, the double escaping
falls apart quickly for arbitrary data. It's really only usable when you
know exactly what you are querying and know it will not be a problem.

Regarding the defaults for the output, I think JSON lines (rather than a
JSON array of objects) would be preferred. It's more natural to combine
them and generate that type of data on the fly rather than forcing
aggregation into a single object.

Couple more features / use cases come to mind as well. Even if they're not
part of a first round of this feature I think it'd be helpful to document
them now as it might give some ideas for what does make that first cut:

1. Outputting a top level JSON object without the additional column keys.
IIUC, the top level keys are always the column names. A common use case
would be a single json/jsonb column that is already formatted exactly as
the user would like for output. Rather than enveloping it in an object with
a dedicated key, it would be nice to be able to output it directly. This
would allow non-object results to be outputted as well (e.g., lines of JSON
arrays, numbers, or strings). Due to how JSON is structured, I think this
would play nice with the JSON lines v.s. array concept.

COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM
generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
SOME_OPTION_TO_NOT_ENVELOPE)
{"foo":1}
{"foo":2}
{"foo":3}

2. An option to ignore null fields so they are excluded from the output.
This would not be a default but would allow shrinking the total size of the
output data in many situations. This would be recursive to allow nested
objects to be shrunk down (not just the top level). This might be
worthwhile as a standalone JSON function though handling it during output
would be more efficient as it'd only be read once.

COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM
generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS)
{}
{"foo":2}
{"foo":3}

3. Reverse of #2 when copying data in to allow defaulting missing fields to
NULL.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: PG 16 draft release notes ready

2023-05-19 Thread Sehrope Sarkuni
The intro in "E.1.3. Changes" says "... between PostgreSQL 15 and the
previous major release".

That should be "... between PostgreSQL __16__ ..." right?

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Proposal to provide the facility to set binary format output for specific OID's per session

2022-07-25 Thread Sehrope Sarkuni
Idea here makes sense and I've seen this brought up repeatedly on the JDBC
lists.

Does the driver need to be aware that this SET command was executed? I'm
wondering what happens if an end user executes this with an OID the driver
does not actually know how to handle.

> + Oid *tmpOids = palloc(length+1);
> ...
> + tmpOids = repalloc(tmpOids, length+1);

These should be: sizeof(Oid) * (length + 1)

Also, I think you need to specify an explicit context via
MemoryContextAlloc or the allocated memory will be in the default context
and released at the end of the command.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Add jsonlog log_destination for JSON server logs

2021-09-17 Thread Sehrope Sarkuni
On Thu, Sep 16, 2021 at 9:36 PM Michael Paquier  wrote:

> I am not really impressed by 0001 and the introduction of
> LOG_DESTINATIONS_WITH_FILES.  So I would leave that out and keep the
> list of destinations listed instead.  And we are talking about two
> places here, only within syslogger.c.
>

I've taken that out for now. The idea was to simplify the additions when
jsonlog is added but can circle back to that later if it makes sense.


> 0002 looks like an improvement,


Nice. That's left unchanged (renamed to 0001 now).


> but I think that 0003 makes the
> readability of the code worse with the introduction of eight static
> routines to handle all those cases.
>
> file_log_dest_should_rotate_for_size(), file_log_dest_close(),
> file_log_dest_check_rotate_for_size(), get_syslogger_file() don't
> bring major improvements.  On the contrary,
> file_log_dest_write_metadata and file_log_dest_rotate seem to reduce a
> bit the noise.
>

It was helpful to split them out while working on the patch but I see your
point upon re-reading through the result.

Attached version (renamed to 002) adds only three static functions for
checking rotation size, performing the actual rotation, and closing. Also
one other to handle generating the logfile names with a suffix to handle
the pfree-ing (wrapping logfile_open(...)).

The rest of the changes happen in place using the new structures.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
From c85e15ffc6a8e310c10be4c85580980d04846bf2 Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Thu, 16 Sep 2021 15:57:00 -0400
Subject: [PATCH 1/2] Split out syslogger EXEC_BACKEND fd serialization and
 opening into helper functions

---
 src/backend/postmaster/syslogger.c | 109 -
 1 file changed, 44 insertions(+), 65 deletions(-)

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index bca3883572..48b4c48a18 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -730,9 +730,23 @@ SysLogger_Start(void)
 	return 0;
 }
 
-
 #ifdef EXEC_BACKEND
 
+static long syslogger_get_fileno(FILE *file)
+{
+#ifndef WIN32
+	if (file != NULL)
+		return (long) fileno(file);
+	else
+		return -1;
+#else			/* WIN32 */
+	if (file != NULL)
+		return (long) _get_osfhandle(_fileno(file));
+	else
+		return 0;
+#endif			/* WIN32 */
+}
+
 /*
  * syslogger_forkexec() -
  *
@@ -751,34 +765,9 @@ syslogger_forkexec(void)
 	av[ac++] = NULL;			/* filled in by postmaster_forkexec */
 
 	/* static variables (those not passed by write_backend_variables) */
-#ifndef WIN32
-	if (syslogFile != NULL)
-		snprintf(filenobuf, sizeof(filenobuf), "%d",
- fileno(syslogFile));
-	else
-		strcpy(filenobuf, "-1");
-#else			/* WIN32 */
-	if (syslogFile != NULL)
-		snprintf(filenobuf, sizeof(filenobuf), "%ld",
- (long) _get_osfhandle(_fileno(syslogFile)));
-	else
-		strcpy(filenobuf, "0");
-#endif			/* WIN32 */
+	snprintf(filenobuf, sizeof(filenobuf), "%ld", syslogger_get_fileno(syslogFile));
 	av[ac++] = filenobuf;
-
-#ifndef WIN32
-	if (csvlogFile != NULL)
-		snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%d",
- fileno(csvlogFile));
-	else
-		strcpy(csvfilenobuf, "-1");
-#else			/* WIN32 */
-	if (csvlogFile != NULL)
-		snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%ld",
- (long) _get_osfhandle(_fileno(csvlogFile)));
-	else
-		strcpy(csvfilenobuf, "0");
-#endif			/* WIN32 */
+	snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%ld", syslogger_get_fileno(csvlogFile));
 	av[ac++] = csvfilenobuf;
 
 	av[ac] = NULL;
@@ -787,6 +776,31 @@ syslogger_forkexec(void)
 	return postmaster_forkexec(ac, av);
 }
 
+static FILE* syslogger_fdopen(int fd)
+{
+	FILE *file = NULL;
+
+#ifndef WIN32
+	if (fd != -1)
+	{
+		file = fdopen(fd, "a");
+		setvbuf(file, NULL, PG_IOLBF, 0);
+	}
+#else			/* WIN32 */
+	if (fd != 0)
+	{
+		fd = _open_osfhandle(fd, _O_APPEND | _O_TEXT);
+		if (fd > 0)
+		{
+			file = fdopen(fd, "a");
+			setvbuf(file, NULL, PG_IOLBF, 0);
+		}
+	}
+#endif			/* WIN32 */
+
+	return file;
+}
+
 /*
  * syslogger_parseArgs() -
  *
@@ -795,8 +809,6 @@ syslogger_forkexec(void)
 static void
 syslogger_parseArgs(int argc, char *argv[])
 {
-	int			fd;
-
 	Assert(argc == 5);
 	argv += 3;
 
@@ -807,41 +819,8 @@ syslogger_parseArgs(int argc, char *argv[])
 	 * fails there's not a lot we can do to report the problem anyway.  As
 	 * coded, we'll just crash on a null pointer dereference after failure...
 	 */
-#ifndef WIN32
-	fd = atoi(*argv++);
-	if (fd != -1)
-	{
-		syslogFile = fdopen(fd, "a");
-		setvbuf(syslogFile, NULL, PG_IOLBF, 0);
-	}
-	fd = atoi(*argv++);
-	if (fd != -1)
-	{
-		csvlogFile = fdopen(fd, "a");
-		setvbuf(csvlogFile, NULL, PG_IOLBF, 0);
-	}
-#else			/* 

Re: Add jsonlog log_destination for JSON server logs

2021-09-16 Thread Sehrope Sarkuni
The previous patches failed on the cfbot Appveyor Windows build. That also
makes me question whether I did the EXEC_BACKEND testing correctly as it
should have caught that compile error. I'm going to look into that.

In the meantime the attached should correct that part of the build.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

>
From 0f7978566b7770465060cd497a8dd4102b313878 Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Thu, 16 Sep 2021 14:43:31 -0400
Subject: [PATCH 1/3] Add constant for list of log destinations that use files

---
 src/backend/postmaster/syslogger.c | 5 ++---
 src/include/utils/elog.h   | 3 +++
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index bca3883572..bc546af7ff 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -420,7 +420,7 @@ SysLoggerMain(int argc, char *argv[])
 			 * was sent by pg_rotate_logfile() or "pg_ctl logrotate".
 			 */
 			if (!time_based_rotation && size_rotation_for == 0)
-size_rotation_for = LOG_DESTINATION_STDERR | LOG_DESTINATION_CSVLOG;
+size_rotation_for = LOG_DESTINATIONS_WITH_FILES;
 			logfile_rotate(time_based_rotation, size_rotation_for);
 		}
 
@@ -1465,8 +1465,7 @@ update_metainfo_datafile(void)
 	FILE	   *fh;
 	mode_t		oumask;
 
-	if (!(Log_destination & LOG_DESTINATION_STDERR) &&
-		!(Log_destination & LOG_DESTINATION_CSVLOG))
+	if (!(Log_destination & LOG_DESTINATIONS_WITH_FILES))
 	{
 		if (unlink(LOG_METAINFO_DATAFILE) < 0 && errno != ENOENT)
 			ereport(LOG,
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index f53607e12e..bea8b93da6 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -437,6 +437,9 @@ extern bool syslog_split_messages;
 #define LOG_DESTINATION_EVENTLOG 4
 #define LOG_DESTINATION_CSVLOG	 8
 
+/* Log destinations with file handles */
+#define LOG_DESTINATIONS_WITH_FILES (LOG_DESTINATION_CSVLOG | LOG_DESTINATION_STDERR)
+
 /* Other exported functions */
 extern void DebugFileOpen(void);
 extern char *unpack_sql_state(int sql_state);
-- 
2.25.1

From a22760648dad3832869b8c67886826e7cd306e52 Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Thu, 16 Sep 2021 15:57:00 -0400
Subject: [PATCH 2/3] Split out syslogger EXEC_BACKEND fd serialization and
 opening into helper functions

---
 src/backend/postmaster/syslogger.c | 109 -
 1 file changed, 44 insertions(+), 65 deletions(-)

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index bc546af7ff..4f0477794e 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -730,9 +730,23 @@ SysLogger_Start(void)
 	return 0;
 }
 
-
 #ifdef EXEC_BACKEND
 
+static long syslogger_get_fileno(FILE *file)
+{
+#ifndef WIN32
+	if (file != NULL)
+		return (long) fileno(file);
+	else
+		return -1;
+#else			/* WIN32 */
+	if (file != NULL)
+		return (long) _get_osfhandle(_fileno(file));
+	else
+		return 0;
+#endif			/* WIN32 */
+}
+
 /*
  * syslogger_forkexec() -
  *
@@ -751,34 +765,9 @@ syslogger_forkexec(void)
 	av[ac++] = NULL;			/* filled in by postmaster_forkexec */
 
 	/* static variables (those not passed by write_backend_variables) */
-#ifndef WIN32
-	if (syslogFile != NULL)
-		snprintf(filenobuf, sizeof(filenobuf), "%d",
- fileno(syslogFile));
-	else
-		strcpy(filenobuf, "-1");
-#else			/* WIN32 */
-	if (syslogFile != NULL)
-		snprintf(filenobuf, sizeof(filenobuf), "%ld",
- (long) _get_osfhandle(_fileno(syslogFile)));
-	else
-		strcpy(filenobuf, "0");
-#endif			/* WIN32 */
+	snprintf(filenobuf, sizeof(filenobuf), "%ld", syslogger_get_fileno(syslogFile));
 	av[ac++] = filenobuf;
-
-#ifndef WIN32
-	if (csvlogFile != NULL)
-		snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%d",
- fileno(csvlogFile));
-	else
-		strcpy(csvfilenobuf, "-1");
-#else			/* WIN32 */
-	if (csvlogFile != NULL)
-		snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%ld",
- (long) _get_osfhandle(_fileno(csvlogFile)));
-	else
-		strcpy(csvfilenobuf, "0");
-#endif			/* WIN32 */
+	snprintf(csvfilenobuf, sizeof(csvfilenobuf), "%ld", syslogger_get_fileno(csvlogFile));
 	av[ac++] = csvfilenobuf;
 
 	av[ac] = NULL;
@@ -787,6 +776,31 @@ syslogger_forkexec(void)
 	return postmaster_forkexec(ac, av);
 }
 
+static FILE* syslogger_fdopen(int fd)
+{
+	FILE *file = NULL;
+
+#ifndef WIN32
+	if (fd != -1)
+	{
+		file = fdopen(fd, "a");
+		setvbuf(file, NULL, PG_IOLBF, 0);
+	}
+#else			/* WIN32 */
+	if (fd != 0)
+	{
+		fd = _open_osfhandle(fd, _O_APPEND | _O_TEXT);
+		if (fd > 0)
+		{
+			file = fdopen(fd, "a");
+			setvbuf(file, NULL, PG_IOLBF, 0);
+		}
+	}
+#endif			/* WIN32 */
+
+	return file;
+}
+
 /*

Re: Add jsonlog log_destination for JSON server logs

2021-09-16 Thread Sehrope Sarkuni
Attached three patches refactor the syslogger handling of file based
destinations a bit ... and then a lot.

First patch adds a new constant to represent both file based destinations.
This should make it easier to ensure additional destinations are handled in
"For all file destinations..." situations (e.g. when we add the jsonlog
destination).

Second patch refactors the file descriptor serialization and re-opening in
EXEC_BACKEND forking. Previously the code was duplicated for both stderr
and csvlog. Again, this should simplify adding new destinations as they'd
just invoke the same helper. There's an existing comment about not handling
failed opens in syslogger_parseArgs(...) and this patch doesn't fix that,
but it does provide a single location to do so in the future.

The third patch adds a new internal (to syslogger.c) structure,
FileLogDestination, for file based log destinations and modifies the
existing handling for syslogFile and csvlogFile to defer to a bunch of
helper functions. It also renames "syslogFile" to "stderr_file_info".
Working through this patch, it was initially confusing that the stderr log
file was named "syslogFile", the C file is named syslogger.c, and there's
an entirely separate syslog (the message logging API) destination. I think
this clears that up a bit.

The patches pass check-world on Linux. I haven't tested it on Windows but
it does pass check-world with EXEC_BACKEND defined to try out the forking
code paths. Definitely needs some review to ensure it's functionally
correct for the different log files.

I haven't tried splitting the csvlog code out or adding the new jsonlog
atop this yet. There's enough changes here that I'd like to get this
settled first.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
From bd5a4ff0435c721de3e7eb9b9207d9e96d79baf4 Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Thu, 16 Sep 2021 14:43:31 -0400
Subject: [PATCH 1/3] Add constant for list of log destinations that use files

---
 src/backend/postmaster/syslogger.c | 5 ++---
 src/include/utils/elog.h   | 3 +++
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index bca3883572..bc546af7ff 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -420,7 +420,7 @@ SysLoggerMain(int argc, char *argv[])
 			 * was sent by pg_rotate_logfile() or "pg_ctl logrotate".
 			 */
 			if (!time_based_rotation && size_rotation_for == 0)
-size_rotation_for = LOG_DESTINATION_STDERR | LOG_DESTINATION_CSVLOG;
+size_rotation_for = LOG_DESTINATIONS_WITH_FILES;
 			logfile_rotate(time_based_rotation, size_rotation_for);
 		}
 
@@ -1465,8 +1465,7 @@ update_metainfo_datafile(void)
 	FILE	   *fh;
 	mode_t		oumask;
 
-	if (!(Log_destination & LOG_DESTINATION_STDERR) &&
-		!(Log_destination & LOG_DESTINATION_CSVLOG))
+	if (!(Log_destination & LOG_DESTINATIONS_WITH_FILES))
 	{
 		if (unlink(LOG_METAINFO_DATAFILE) < 0 && errno != ENOENT)
 			ereport(LOG,
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index f53607e12e..bea8b93da6 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -437,6 +437,9 @@ extern bool syslog_split_messages;
 #define LOG_DESTINATION_EVENTLOG 4
 #define LOG_DESTINATION_CSVLOG	 8
 
+/* Log destinations with file handles */
+#define LOG_DESTINATIONS_WITH_FILES (LOG_DESTINATION_CSVLOG | LOG_DESTINATION_STDERR)
+
 /* Other exported functions */
 extern void DebugFileOpen(void);
 extern char *unpack_sql_state(int sql_state);
-- 
2.17.1

From 653ea6865ce81de89bec4fc41f3c0be2933e6b2b Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Thu, 16 Sep 2021 15:57:00 -0400
Subject: [PATCH 2/3] Split out syslogger EXEC_BACKEND fd serialization and
 opening into helper functions

---
 src/backend/postmaster/syslogger.c | 109 -
 1 file changed, 44 insertions(+), 65 deletions(-)

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index bc546af7ff..4f0477794e 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -730,9 +730,23 @@ SysLogger_Start(void)
 	return 0;
 }
 
-
 #ifdef EXEC_BACKEND
 
+static long syslogger_get_fileno(FILE *file)
+{
+#ifndef WIN32
+	if (file != NULL)
+		return (long) fileno(file);
+	else
+		return -1;
+#else			/* WIN32 */
+	if (file != NULL)
+		return (long) _get_osfhandle(_fileno(file));
+	else
+		return 0;
+#endif			/* WIN32 */
+}
+
 /*
  * syslogger_forkexec() -
  *
@@ -751,34 +765,9 @@ syslogger_forkexec(void)
 	av[ac++] = NULL;			/* filled in by postmaster_forkexec */
 
 	/* static variables (those not passed by write_backend_variables) */
-#ifndef WIN32
-	if (syslogFile != NULL)
-		snprintf(filenobuf, sizeof(filenobuf), "%d",
- fileno(syslogFile));
-	else
-		st

Re: Add jsonlog log_destination for JSON server logs

2021-09-13 Thread Sehrope Sarkuni
On Sun, Sep 12, 2021 at 10:22 PM Michael Paquier 
wrote:

> On Fri, Sep 10, 2021 at 03:56:18PM +0900, Michael Paquier wrote:
> > And this part leads me to the attached, where the addition of the JSON
> > format would result in the addition of a couple of lines.
>
> Okay, I have worked through the first half of the patch set, and
> applied the improved versions of 0001 (refactoring of the chunk
> protocol) and 0002 (addition of the tests for csvlog).
>

Thanks. I finally got a chance to look through those changes. I like it.
The popcount and pulling out the flags are much easier to follow than the
old hard coded letters.


> I have not looked in details at 0003 and 0004 yet.  Still, here are
> some comments after a quick scan.
>
> + * elog-internal.h
> I'd rather avoid the hyphen, and use elog_internal.h.
>
> +#define FORMATTED_TS_LEN 128
> +extern char formatted_start_time[FORMATTED_TS_LEN];
> +extern char formatted_log_time[FORMATTED_TS_LEN];
> +
> +void setup_formatted_log_time(void);
> +void setup_formatted_start_time(void);
> We could just use a static buffer in each one of those routines, and
> return back a pointer to the caller.
>

+1


> +   else if ((Log_destination & LOG_DESTINATION_JSONLOG) &&
> +   (jsonlogFile == NULL ||
> +time_based_rotation || (size_rotation_for &
> LOG_DESTINATION_JSONLOG)))
> [...]
> +   /* Write to JSON log if enabled */
> +   else if (Log_destination & LOG_DESTINATION_JSONLOG)
> +   {
> Those bits in 0004 are wrong.  They should be two "if" clauses.  This
> is leading to issues when setting log_destination to multiple values
> with jsonlog in the set of values and logging_connector = on, and the
> logs are not getting redirected properly to the .json file.  We would
> get the data for the .log and .csv files though.  This issue also
> happens with the original patch set applied on top of e757080.   I
> think that we should also be more careful with the way we free
> StringInfoData in send_message_to_server_log(), or we are going to
> finish with unwelcome leaks.
>

Good catch. Staring at that piece again, that's tricky as it tries to
aggressively free the buffer before calling write_cvslog(...). Which can't
just be duplicated for additional destinations.

I think we need to pull up the negative case (i.e. syslogger not available)
before the other destinations and if it matches, free and exit early.
Otherwise, free the buffer and call whatever destination routines are
enabled.


> The same coding pattern is now repeated three times in logfile_rotate():
> - Check if a logging type is enabled.
> - Optionally open new file, with logfile_open().
> - Business with ENFILE and EMFILE.
> - pfree() and save of the static FILE* ane file name for each type.
> I think that we have enough material for a wrapper routine that does
> this work, where we pass down LOG_DESTINATION_* and pointers to the
> static FILE* and the static last file names.  That would have avoided
> the bug I found above.
>

I started on a bit of this as well. There's so much overlap already between
the syslog_ and csvlog code that I'm going to put that together first. Then
the json addition should just fall into a couple of call sites.

I'm thinking of adding an internal struct to house the FILE* and file
names. Then all the opening, closing, and log rotation code can pass
pointers to those and centralize the pfree() and NULL checks.


> The rebased patch set, that has reworked the tests to be in line with
> HEAD, also fails.  They compile.
>
> Sehrope, could you adjust that?


Yes I'm looking to sync those up and address the above later this week.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Add jsonlog log_destination for JSON server logs

2021-09-01 Thread Sehrope Sarkuni
Updated patch set is attached.

This version splits out the existing csvlog code into its own file and
centralizes the common helpers into a new elog-internal.h so that they're
only included by the actual write_xyz sources.

That makes the elog.c changes in the JSON logging patch minimal as all it's
really doing is invoking the new write_jsonlog(...) function.

It also adds those missing fields to the JSON logger output.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
From d5b3f5fe44e91d35aefdd570758d5b2a9e9c1a36 Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Wed, 10 Jul 2019 10:02:31 -0400
Subject: [PATCH 1/4] Adds separate dest field to log protocol PipeProtoHeader

Adds a separate dest field to PipeProtoHeader to store the log destination
requested by the sending process. Also changes the is_last field to only
store whether the chunk is the last one for a message rather than also
including whether the destination is csvlog.
---
 src/backend/postmaster/syslogger.c | 15 ++-
 src/backend/utils/error/elog.c |  4 +++-
 src/include/postmaster/syslogger.h |  4 ++--
 3 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index cad43bdef2..edd8f33204 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -878,7 +878,6 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 {
 	char	   *cursor = logbuffer;
 	int			count = *bytes_in_logbuffer;
-	int			dest = LOG_DESTINATION_STDERR;
 
 	/* While we have enough for a header, process data... */
 	while (count >= (int) (offsetof(PipeProtoHeader, data) + 1))
@@ -891,8 +890,9 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 		if (p.nuls[0] == '\0' && p.nuls[1] == '\0' &&
 			p.len > 0 && p.len <= PIPE_MAX_PAYLOAD &&
 			p.pid != 0 &&
-			(p.is_last == 't' || p.is_last == 'f' ||
-			 p.is_last == 'T' || p.is_last == 'F'))
+			(p.is_last == 't' || p.is_last == 'f') &&
+			(p.dest == LOG_DESTINATION_CSVLOG ||
+			 p.dest == LOG_DESTINATION_STDERR))
 		{
 			List	   *buffer_list;
 			ListCell   *cell;
@@ -906,9 +906,6 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 			if (count < chunklen)
 break;
 
-			dest = (p.is_last == 'T' || p.is_last == 'F') ?
-LOG_DESTINATION_CSVLOG : LOG_DESTINATION_STDERR;
-
 			/* Locate any existing buffer for this source pid */
 			buffer_list = buffer_lists[p.pid % NBUFFER_LISTS];
 			foreach(cell, buffer_list)
@@ -924,7 +921,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 	free_slot = buf;
 			}
 
-			if (p.is_last == 'f' || p.is_last == 'F')
+			if (p.is_last == 'f')
 			{
 /*
  * Save a complete non-final chunk in a per-pid buffer
@@ -970,7 +967,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 	appendBinaryStringInfo(str,
 		   cursor + PIPE_HEADER_SIZE,
 		   p.len);
-	write_syslogger_file(str->data, str->len, dest);
+	write_syslogger_file(str->data, str->len, p.dest);
 	/* Mark the buffer unused, and reclaim string storage */
 	existing_slot->pid = 0;
 	pfree(str->data);
@@ -979,7 +976,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 {
 	/* The whole message was one chunk, evidently. */
 	write_syslogger_file(cursor + PIPE_HEADER_SIZE, p.len,
-		 dest);
+		 p.dest);
 }
 			}
 
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index a3e1c59a82..cd13111708 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -3250,6 +3250,8 @@ write_pipe_chunks(char *data, int len, int dest)
 
 	p.proto.nuls[0] = p.proto.nuls[1] = '\0';
 	p.proto.pid = MyProcPid;
+	p.proto.dest = (int32) dest;
+	p.proto.is_last = 'f';
 
 	/* write all but the last chunk */
 	while (len > PIPE_MAX_PAYLOAD)
@@ -3264,7 +3266,7 @@ write_pipe_chunks(char *data, int len, int dest)
 	}
 
 	/* write the last chunk */
-	p.proto.is_last = (dest == LOG_DESTINATION_CSVLOG ? 'T' : 't');
+	p.proto.is_last = 't';
 	p.proto.len = len;
 	memcpy(p.proto.data, data, len);
 	rc = write(fd, , PIPE_HEADER_SIZE + len);
diff --git a/src/include/postmaster/syslogger.h b/src/include/postmaster/syslogger.h
index 1491eecb0f..41d026a474 100644
--- a/src/include/postmaster/syslogger.h
+++ b/src/include/postmaster/syslogger.h
@@ -46,8 +46,8 @@ typedef struct
 	char		nuls[2];		/* always \0\0 */
 	uint16		len;			/* size of this chunk (counts data only) */
 	int32		pid;			/* writer's pid */
-	char		is_last;		/* last chunk of message? 't' or 'f' ('T' or
- * 'F' for CSV case) */
+	int32		dest;			/* log destination */
+	char		is_last;/* last chunk of message? 't' or 'f'*/
 	char		data[FLEXIBLE_ARRAY_MEMBER];	/* data payload starts here */
 } PipeProtoHeader;
 
-- 
2.17.1

From dfb17c0b1804b

Re: Add jsonlog log_destination for JSON server logs

2021-09-01 Thread Sehrope Sarkuni
On Tue, Aug 31, 2021 at 8:43 PM Michael Paquier  wrote:

> On Tue, Aug 31, 2021 at 11:34:56AM -0400, Sehrope Sarkuni wrote:
> > The second commit adds a TAP test for log_destination "csvlog". This was
> > done to both confirm that the previous change didn't break anything and
> as
> > a skeleton for the test in the next commit.
>
> +note "Before sleep";
> +usleep(100_000);
> +note "Before rotate";
> +$node->logrotate();
> +note "After rotate";
> +usleep(100_000);
>
> Do you really need a rotation of the log files here?  Wouldn't it be
> better to grab the position of the current log file with a fixed log
> file name, and then slurp the file from this position with your
> expected output?  That would make the test faster, as well.
>

Yes, that was intentional. I used the log rotation tap test as a base and
kept that piece in there so it verifies that the csv log files are actually
rotated. Same for the json logs.


> Rather than making elog.c larger, I think that we should try to split
> that into more files.  Why not refactoring out the CSV part first?
> You could just call that csvlog.c, then create a new jsonlog.c for the
> meat of the patch.
>

That's a good idea. I'll try that out.

The list of fields is not up to date.  At quick glance, you are
> missing:
> - backend type.
> - leader PID.
> - query ID.
> - Session start timestamp (?)
>

Thanks. I'll take a look.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Add jsonlog log_destination for JSON server logs

2021-08-31 Thread Sehrope Sarkuni
Hi,

This patch adds a new log_destination, "jsonlog", that writes log entries
as lines of JSON. It was originally started by David Fetter using
the jsonlog module by Michael Paquier (
https://github.com/michaelpq/pg_plugins/blob/master/jsonlog/jsonlog.c) as a
basis for how to serialize the log messages. Thanks to both of them because
this wouldn't be possible without that starting point.

The first commit splits out the destination in log pipe messages into its
own field. Previously it would piggyback on the "is_last" field. This adds
an int to the message size but makes the rest of the code easier to follow.

The second commit adds a TAP test for log_destination "csvlog". This was
done to both confirm that the previous change didn't break anything and as
a skeleton for the test in the next commit.

The third commit adds the new log_destination "jsonlog". The output format
is one line per entry with the top level output being a JSON object keyed
with the log fields. Newlines in the output fields are escaped as \n so the
output file has exactly one line per log entry. It also includes a new test
for verifying the JSON output with some basic regex checks (similar to the
csvlog test).

Here's a sample of what the log entries look like:

{"timestamp":"2021-08-31 10:15:25.129
EDT","user":"sehrope","dbname":"postgres","pid":12012,"remote_host":"[local]","session_id":"612e397d.2eec","line_num":1,"ps":"idle","session_start":"2021-08-31
10:15:25
EDT","vxid":"3/2","txid":"0","error_severity":"LOG","application_name":"
006_jsonlog.pl","message":"statement: SELECT 1/0"}

It builds and passes "make check-world" on Linux. It also includes code to
handle Windows as well but I have not actually tried building it there.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
From d5b3f5fe44e91d35aefdd570758d5b2a9e9c1a36 Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Wed, 10 Jul 2019 10:02:31 -0400
Subject: [PATCH 1/3] Adds separate dest field to log protocol PipeProtoHeader

Adds a separate dest field to PipeProtoHeader to store the log destination
requested by the sending process. Also changes the is_last field to only
store whether the chunk is the last one for a message rather than also
including whether the destination is csvlog.
---
 src/backend/postmaster/syslogger.c | 15 ++-
 src/backend/utils/error/elog.c |  4 +++-
 src/include/postmaster/syslogger.h |  4 ++--
 3 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index cad43bdef2..edd8f33204 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -878,7 +878,6 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 {
 	char	   *cursor = logbuffer;
 	int			count = *bytes_in_logbuffer;
-	int			dest = LOG_DESTINATION_STDERR;
 
 	/* While we have enough for a header, process data... */
 	while (count >= (int) (offsetof(PipeProtoHeader, data) + 1))
@@ -891,8 +890,9 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 		if (p.nuls[0] == '\0' && p.nuls[1] == '\0' &&
 			p.len > 0 && p.len <= PIPE_MAX_PAYLOAD &&
 			p.pid != 0 &&
-			(p.is_last == 't' || p.is_last == 'f' ||
-			 p.is_last == 'T' || p.is_last == 'F'))
+			(p.is_last == 't' || p.is_last == 'f') &&
+			(p.dest == LOG_DESTINATION_CSVLOG ||
+			 p.dest == LOG_DESTINATION_STDERR))
 		{
 			List	   *buffer_list;
 			ListCell   *cell;
@@ -906,9 +906,6 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 			if (count < chunklen)
 break;
 
-			dest = (p.is_last == 'T' || p.is_last == 'F') ?
-LOG_DESTINATION_CSVLOG : LOG_DESTINATION_STDERR;
-
 			/* Locate any existing buffer for this source pid */
 			buffer_list = buffer_lists[p.pid % NBUFFER_LISTS];
 			foreach(cell, buffer_list)
@@ -924,7 +921,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 	free_slot = buf;
 			}
 
-			if (p.is_last == 'f' || p.is_last == 'F')
+			if (p.is_last == 'f')
 			{
 /*
  * Save a complete non-final chunk in a per-pid buffer
@@ -970,7 +967,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 	appendBinaryStringInfo(str,
 		   cursor + PIPE_HEADER_SIZE,
 		   p.len);
-	write_syslogger_file(str->data, str->len, dest);
+	write_syslogger_file(str->data, str->len, p.dest);
 	/* Mark the buffer unused, and reclaim string storage */
 	existing_slot->pid = 0;
 	pfree(str->data);
@@ -979,7 +976,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_log

Re: Proposal: More structured logging

2021-08-31 Thread Sehrope Sarkuni
On Tue, Aug 24, 2021 at 7:22 PM Michael Paquier  wrote:

> From a code perspective, and while on it, we could split a bit elog.c
> and move the log entries generated for each format into their own
> file.  That would be cleaner for CSV and JSON.  As a whole I don't
> have an objection with moving the JSON format into core.  If one
> proposes a patch, feel free to reuse the code from the module I have
> released.
>

I had a patch that does exactly this from a few years ago. It started off
copying a bunch out of your json logging module and adds it as a new
"jsonlog" destination.

It needed some cleanup due to bit rot, but it now builds and works atop
master. I'll post it in its own thread.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Add SQL function for SHA1

2021-01-26 Thread Sehrope Sarkuni
On Tue, Jan 26, 2021 at 8:53 PM Michael Paquier  wrote:

> On Tue, Jan 26, 2021 at 10:38:43AM +0100, Daniel Gustafsson wrote:
> > Agreed, and pgcrypto already allows for using sha1.
> >
> > It seems like any legitimate need for sha1 could be better served by an
> > extension rather than supplying it in-core.
>
> Both of you telling the same thing is enough for me to discard this
> new stuff.  I'd like to refactor the code anyway as that's a nice
> cleanup, and this would have the advantage to make people look at
> cryptohashfuncs.c if introducing a new type.  After sleeping about it,
> I think that I would just make MD5 and SHA1 issue an elog(ERROR) if
> the internal routine is taken in those cases, like in the attached.
>

The refactor patch looks good. It builds and passes make check.

Thanks for the enum explanation too.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Add SQL function for SHA1

2021-01-25 Thread Sehrope Sarkuni
+1 to adding a SHA1 SQL function. Even if it's deprecated, there's plenty
of historical usage that I can see it being useful.

Either way, the rest of the refactor can be improved a bit to perform a
single palloc() and remove the memcpy(). Attached is a diff for
cryptohashfuncs.c that does that by writing the digest final directly to
the result. It also removes the digest length arg and determines it in the
switch block. There's only one correct digest length for each type so
there's no reason to give callers the option to give the wrong one.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
diff --git a/src/backend/utils/adt/cryptohashfuncs.c b/src/backend/utils/adt/cryptohashfuncs.c
index d99485f4c6..8dc695e80c 100644
--- a/src/backend/utils/adt/cryptohashfuncs.c
+++ b/src/backend/utils/adt/cryptohashfuncs.c
@@ -15,6 +15,7 @@
 
 #include "common/cryptohash.h"
 #include "common/md5.h"
+#include "common/sha1.h"
 #include "common/sha2.h"
 #include "utils/builtins.h"
 
@@ -68,6 +69,77 @@ md5_bytea(PG_FUNCTION_ARGS)
 	PG_RETURN_TEXT_P(cstring_to_text(hexsum));
 }
 
+/*
+ * Internal routine to compute a cryptohash with the given bytea input.
+ */
+static inline bytea *
+cryptohash_internal(pg_cryptohash_type type, bytea *input)
+{
+	const uint8 *data;
+	const char  *typestr;
+	int			digest_len;
+	size_t		len;
+	pg_cryptohash_ctx *ctx;
+	bytea	   *result;
+
+	switch (type)
+	{
+		case PG_MD5:
+			typestr = "MD5";
+			digest_len = MD5_DIGEST_LENGTH;
+			break;
+		case PG_SHA1:
+			typestr = "SHA1";
+			digest_len = SHA1_DIGEST_LENGTH;
+			break;
+		case PG_SHA224:
+			typestr = "SHA224";
+			digest_len = PG_SHA224_DIGEST_LENGTH;
+			break;
+		case PG_SHA256:
+			typestr = "SHA256";
+			digest_len = PG_SHA256_DIGEST_LENGTH;
+			break;
+		case PG_SHA384:
+			typestr = "SHA384";
+			digest_len = PG_SHA384_DIGEST_LENGTH;
+			break;
+		case PG_SHA512:
+			typestr = "SHA512";
+			digest_len = PG_SHA512_DIGEST_LENGTH;
+			break;
+		default:
+			elog(ERROR, "unsupported digest type %d", type);
+	}
+
+	result = palloc0(digest_len + VARHDRSZ);
+	len = VARSIZE_ANY_EXHDR(input);
+	data = (unsigned char *) VARDATA_ANY(input);
+
+	ctx = pg_cryptohash_create(type);
+	if (pg_cryptohash_init(ctx) < 0)
+		elog(ERROR, "could not initialize %s context", typestr);
+	if (pg_cryptohash_update(ctx, data, len) < 0)
+		elog(ERROR, "could not update %s context", typestr);
+	if (pg_cryptohash_final(ctx, (unsigned char *) VARDATA(result)) < 0)
+		elog(ERROR, "could not finalize %s context", typestr);
+	pg_cryptohash_free(ctx);
+
+	SET_VARSIZE(result, digest_len + VARHDRSZ);
+
+	return result;
+}
+
+/*
+ * SHA-1
+ */
+Datum
+sha1_bytea(PG_FUNCTION_ARGS)
+{
+	bytea   *result = cryptohash_internal(PG_SHA1, PG_GETARG_BYTEA_PP(0));
+
+	PG_RETURN_BYTEA_P(result);
+}
 
 /*
  * SHA-2 variants
@@ -76,28 +148,7 @@ md5_bytea(PG_FUNCTION_ARGS)
 Datum
 sha224_bytea(PG_FUNCTION_ARGS)
 {
-	bytea	   *in = PG_GETARG_BYTEA_PP(0);
-	const uint8 *data;
-	size_t		len;
-	pg_cryptohash_ctx *ctx;
-	unsigned char buf[PG_SHA224_DIGEST_LENGTH];
-	bytea	   *result;
-
-	len = VARSIZE_ANY_EXHDR(in);
-	data = (unsigned char *) VARDATA_ANY(in);
-
-	ctx = pg_cryptohash_create(PG_SHA224);
-	if (pg_cryptohash_init(ctx) < 0)
-		elog(ERROR, "could not initialize %s context", "SHA224");
-	if (pg_cryptohash_update(ctx, data, len) < 0)
-		elog(ERROR, "could not update %s context", "SHA224");
-	if (pg_cryptohash_final(ctx, buf) < 0)
-		elog(ERROR, "could not finalize %s context", "SHA224");
-	pg_cryptohash_free(ctx);
-
-	result = palloc(sizeof(buf) + VARHDRSZ);
-	SET_VARSIZE(result, sizeof(buf) + VARHDRSZ);
-	memcpy(VARDATA(result), buf, sizeof(buf));
+	bytea   *result = cryptohash_internal(PG_SHA224, PG_GETARG_BYTEA_PP(0));
 
 	PG_RETURN_BYTEA_P(result);
 }
@@ -105,28 +156,7 @@ sha224_bytea(PG_FUNCTION_ARGS)
 Datum
 sha256_bytea(PG_FUNCTION_ARGS)
 {
-	bytea	   *in = PG_GETARG_BYTEA_PP(0);
-	const uint8 *data;
-	size_t		len;
-	pg_cryptohash_ctx *ctx;
-	unsigned char buf[PG_SHA256_DIGEST_LENGTH];
-	bytea	   *result;
-
-	len = VARSIZE_ANY_EXHDR(in);
-	data = (unsigned char *) VARDATA_ANY(in);
-
-	ctx = pg_cryptohash_create(PG_SHA256);
-	if (pg_cryptohash_init(ctx) < 0)
-		elog(ERROR, "could not initialize %s context", "SHA256");
-	if (pg_cryptohash_update(ctx, data, len) < 0)
-		elog(ERROR, "could not update %s context", "SHA256");
-	if (pg_cryptohash_final(ctx, buf) < 0)
-		elog(ERROR, "could not finalize %s context", "SHA256");
-	pg_cryptohash_free(ctx);
-
-	result = palloc(sizeof(buf) + VARHDRSZ);
-	SET_VARSIZE(result, sizeof(buf) + VARHDRSZ);
-	memcpy(VARDATA(result), buf, sizeof(buf));
+	bytea   *result = cryptohash_internal(PG_SHA256, PG_GET

Re: Moving other hex functions to /common

2021-01-13 Thread Sehrope Sarkuni
The length functions in src/common/hex.c should cast srclen to uint64 prior
to the shift. The current hex_enc_len(...) in encode.c performs such a
cast.

diff --git a/src/common/hex.c b/src/common/hex.c
index 0123c69697..e87aa1fd7f 100644
--- a/src/common/hex.c
+++ b/src/common/hex.c
@@ -178,7 +178,7 @@ pg_hex_decode(const char *src, size_t srclen, char
*dst, size_t dstlen)
 uint64
 pg_hex_enc_len(size_t srclen)
 {
-   return (srclen << 1);
+   return (uint64) srclen << 1;
 }

 /*
@@ -192,5 +192,5 @@ pg_hex_enc_len(size_t srclen)
 uint64
 pg_hex_dec_len(size_t srclen)
 {
-   return (srclen >> 1);
+   return (uint64) srclen >> 1;
 }

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: 2020-02-13 Press Release Draft

2020-02-10 Thread Sehrope Sarkuni
Typo in 9.4 retirement message:

s/is it time to retire/it is time to retire/

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Internal key management system

2020-02-05 Thread Sehrope Sarkuni
On Sat, Feb 1, 2020 at 7:02 PM Masahiko Sawada <
masahiko.saw...@2ndquadrant.com> wrote:
> On Sun, 2 Feb 2020 at 00:37, Sehrope Sarkuni  wrote:
> >
> > On Fri, Jan 31, 2020 at 1:21 AM Masahiko Sawada
> >  wrote:
> > > On Thu, 30 Jan 2020 at 20:36, Sehrope Sarkuni 
wrote:
> > > > That
> > > > would allow the internal usage to have a fixed output length of
> > > > LEN(IV) + LEN(HMAC) + LEN(DATA) = 16 + 32 + 64 = 112 bytes.
> > >
> > > Probably you meant LEN(DATA) is 32? DATA will be an encryption key for
> > > AES256 (master key) internally generated.
> >
> > No it should be 64-bytes. That way we can have separate 32-byte
> > encryption key (for AES256) and 32-byte MAC key (for HMAC-SHA256).
> >
> > While it's common to reuse the same 32-byte key for both AES256 and an
> > HMAC-SHA256 and there aren't any known issues with doing so, when
> > designing something from scratch it's more secure to use entirely
> > separate keys.
>
> The HMAC key you mentioned above is not the same as the HMAC key
> derived from the user provided passphrase, right? That is, individual
> key needs to have its IV and HMAC key. Given that the HMAC key used
> for HMAC(IV || ENCRYPT(KEY, IV, DATA)) is the latter key (derived from
> passphrase), what will be the former key used for?

It's not derived from the passphrase, it's unlocked by the passphrase
(along with the master encryption key). The server will have 64-bytes of
random data, saved encrypted in pg_control, which can be treated as two
separate 32-byte keys, let's call them master_encryption_key and
master_mac_key. The 64-bytes is unlocked by decrypting it with the user
passphrase at startup (which itself would be split into a pair of
encryption and MAC keys to do the unlocking).

The wrap and unwrap operations would use both keys:

wrap(plain_text, encryption_key, mac_key) {
// Generate random IV:
iv = pg_strong_random(16);
// Encrypt:
cipher_text = encrypt_aes256_cbc(encryption_key, iv, plain_text);
// Compute MAC on all inputs:
mac = hmac_sha256(mac_key, encryption_key || iv || cipher_text);
// Concat user facing pieces together
wrapped = mac || iv || cipher_text;
return wrapped;
}

unwrap(wrapped, encryption_key, mac_key) {
// Split wrapped into its pieces:
actual_mac = wrapped.slice(0, 32);
iv = wrapped.slice(0 + 32, 16);
cipher_text = wrapped.slice(0 + 32 + 16);
// Compute MAC on all inputs:
expected_mac = hmac_sha256(mac_key, encryption_key || iv ||
cipher_text);
// Compare MAC vs value in wrapped:
if (expected_mac != actual_mac) { return Error("MAC does not match"); }
// MAC matches so decrypt:
plain_text = decrypt_aes256_cbc(encryption_key, iv, cipher_text);
return plain_text;
}

Every input to the encryption operation, including the encryption key, must
be included into the HMAC calculation. If you use the same key for both
encryption and MAC that's not required as it's already part of the MAC
process as the key. Using separate keys requires explicitly adding in the
encryption key into the MAC input to ensure that it the correct key prior
to decryption in the unwrap operation. Any additional parts of the wrapped
output (ex: a "version" byte for the algos or padding choices) should also
be included.

The wrap / unwrap above would be used with the encryption and mac keys
derived from the user passphrase to unlock the master_encryption_key and
master_mac_key from pg_control. Then those would be used by the higher
level functions:

pg_kmgr_wrap(plain_text) {
    return wrap(plain_text, master_encryption_key, master_mac_key);
}

pg_kmgr_unwrap(wrapped) {
return unwrap(wrapped, master_encryption_key, master_mac_key);
}

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2020-02-01 Thread Sehrope Sarkuni
On Fri, Jan 31, 2020 at 1:21 AM Masahiko Sawada
 wrote:
> On Thu, 30 Jan 2020 at 20:36, Sehrope Sarkuni  wrote:
> > That
> > would allow the internal usage to have a fixed output length of
> > LEN(IV) + LEN(HMAC) + LEN(DATA) = 16 + 32 + 64 = 112 bytes.
>
> Probably you meant LEN(DATA) is 32? DATA will be an encryption key for
> AES256 (master key) internally generated.

No it should be 64-bytes. That way we can have separate 32-byte
encryption key (for AES256) and 32-byte MAC key (for HMAC-SHA256).

While it's common to reuse the same 32-byte key for both AES256 and an
HMAC-SHA256 and there aren't any known issues with doing so, when
designing something from scratch it's more secure to use entirely
separate keys.

> > For the user facing piece, padding would enabled to support arbitrary
> > input data lengths. That would make the output length grow by up to
> > 16-bytes (rounding the data length up to the AES block size) plus one
> > more byte if a version field is added.
>
> I think the length of padding also needs to be added to the output.
> Anyway, in the first version the same methods of wrapping/unwrapping
> key are used for both internal use and user facing function. And user
> input key needs to be a multiple of 16 bytes value.

A separate length field does not need to be added as the
padding-enabled output will already include it at the end[1]. This
would be handled automatically by the OpenSSL encryption / decryption
operations (if it's enabled):

[1]: https://en.wikipedia.org/wiki/Padding_(cryptography)#PKCS#5_and_PKCS#7

> BTW I think this topic is better to be discussed on a separate thread
> as the scope no longer includes TDE. I'll start a new thread for
> introducing internal KMS.

Good idea.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2020-01-30 Thread Sehrope Sarkuni
On Wed, Jan 29, 2020 at 3:43 AM Masahiko Sawada
 wrote:
> Thank you for the suggestion. I like your suggestion. We can do an
> integrity check of the user input wrapped key by using HMAC when
> unwrapping. Regarding the output format you meant to use aes-256
> rather than aes-256-key-wrap? I think that DATA in the output is the
> user input key so it still must be multiples of 16-bytes if we use
> aes-256-key-wrap.

Yes I'm suggesting not using the key wrap functions and instead using
the regular EVP_aes_256_cbc with a random IV per invocation. For
internal usage (e.g. the encrypted key) it does not need padding as we
know the input value would always be a multiple of 16-bytes. That
would allow the internal usage to have a fixed output length of
LEN(IV) + LEN(HMAC) + LEN(DATA) = 16 + 32 + 64 = 112 bytes.

For the user facing piece, padding would enabled to support arbitrary
input data lengths. That would make the output length grow by up to
16-bytes (rounding the data length up to the AES block size) plus one
more byte if a version field is added.

> BTW regarding the implementation of cipher function using opensssl in
> the src/common I'm concerned whether we should integrate it with the
> openssl.c in pgcrypto. Since pgcrypto with openssl currently supports
> aes, des and bf etc the cipher function code in this patch also has
> similar functionality. Similarly when we introduced SCRAM we moved
> sha2 functions from pgcrypto to src/common. I thought we move all
> cipher functions in pgcrypto to src/common but it might be overkill
> because the internal KMS will use only aes with only 256 key length as
> of now.

I'd keep the patch smaller and the functions internal to the KMS for
now. Maybe address it again after the patch is complete as it'll be
easier to see overlaps that could be combined.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2020-01-25 Thread Sehrope Sarkuni
but it's only 64-bits (v.s. say
256-bits for a full HMAC-SHA-256).

Rather than use EVP_aes_256_wrap() with its defaults, we can generate
a random IV and have the output be "IV || ENCRYPT(KEY, IV, DATA) ||
HMAC(IV || ENCRYPT(KEY, IV, DATA))". For a fixed length internal input
(ex: the KEK encrypted key stored in pg_control) there's no need for
padding as we're dealing with multiples of 16-bytes (ex: KEK encrypted
enc-key / mac-key would be 64-bytes).

It'd also be useful if the user level wrap/unwrap API allowed for
arbitrary sized inputs (not just multiples of 16-byte). Having the
output be in a standard format (i.e. matching OpenSSL's
EVP_aes_256_wrap API) is nice, but as it's meant to be an opaque
interface I think it's fine if the output is not usable outside the
database. I don't see anyone using the wrapped data directly as it's
random bytes without the key. The primary contract for the interface:
"data == unwrap(wrap(data))". This would require enabling padding
which would round up the size of the output to the next 16-bytes.
Adding a prefix byte for a "version" would be nice too as it could be
used to infer the specific cipher/mac combo (Ex: v1 would be
AES256/HMAC-SHA256). I don't think the added size is an issue as
again, the output is opaque. Similar things can also be accomplished
by combining the 16-byte only version with pgcrypto but like this it'd
be usable out of the box without additional extensions.

[1]: https://www.openssl.org/docs/man1.1.1/man3/EVP_aes_256_wrap.html
[2]: https://tools.ietf.org/html/rfc5649

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
diff --git a/configure b/configure
index 702adba839..2daa5bf0c2 100755
--- a/configure
+++ b/configure
@@ -12113,7 +12113,7 @@ done
   # defines OPENSSL_VERSION_NUMBER to claim version 2.0.0, even though it
   # doesn't have these OpenSSL 1.1.0 functions. So check for individual
   # functions.
-  for ac_func in OPENSSL_init_ssl BIO_get_data BIO_meth_new ASN1_STRING_get0_data
+  for ac_func in OPENSSL_init_ssl OPENSSL_init_crypto BIO_get_data BIO_meth_new ASN1_STRING_get0_data
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 8165f70039..79329d9f15 100644
--- a/configure.in
+++ b/configure.in
@@ -1194,7 +1194,7 @@ if test "$with_openssl" = yes ; then
   # defines OPENSSL_VERSION_NUMBER to claim version 2.0.0, even though it
   # doesn't have these OpenSSL 1.1.0 functions. So check for individual
   # functions.
-  AC_CHECK_FUNCS([OPENSSL_init_ssl BIO_get_data BIO_meth_new ASN1_STRING_get0_data])
+  AC_CHECK_FUNCS([OPENSSL_init_ssl OPENSSL_init_crypto BIO_get_data BIO_meth_new ASN1_STRING_get0_data])
   # OpenSSL versions before 1.1.0 required setting callback functions, for
   # thread-safety. In 1.1.0, it's no longer required, and CRYPTO_lock()
   # function was removed.
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 9706a95848..4ace302038 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -21,7 +21,7 @@ SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \
 	main nodes optimizer partitioning port postmaster \
 	regex replication rewrite \
 	statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
-	jit
+	jit crypto
 
 include $(srcdir)/common.mk
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0e..5293137225 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -41,6 +41,7 @@
 #include "catalog/pg_database.h"
 #include "commands/tablespace.h"
 #include "common/controldata_utils.h"
+#include "crypto/kmgr.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -77,6 +78,7 @@
 #include "utils/timestamp.h"
 
 extern uint32 bootstrap_data_checksum_version;
+extern uint32 bootstrap_data_encryption_cipher;
 
 /* Unsupported old recovery command file names (relative to $PGDATA) */
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
@@ -4779,6 +4781,10 @@ ReadControlFile(void)
 	/* Make the initdb settings visible as GUC variables, too */
 	SetConfigOption("data_checksums", DataChecksumsEnabled() ? "yes" : "no",
 	PGC_INTERNAL, PGC_S_OVERRIDE);
+
+	SetConfigOption("data_encryption_cipher",
+	kmgr_cipher_string(GetDataEncryptionCipher()),
+	PGC_INTERNAL, PGC_S_OVERRIDE);
 }
 
 /*
@@ -4811,6 +4817,13 @@ GetMockAuthenticationNonce(void)
 	return ControlFile->mock_authentication_nonce;
 }
 
+WrappedEncKeyWithHmac *
+GetMasterEncryptionKey(void)
+{
+	Assert(ControlFile != NULL);
+	return &(ControlFile->master_dek);
+}
+
 /*
  * Are checksums enabled for data pages?
  */
@@ -4821,6 +4834,13 @@ DataChecksumsEn

Re: ssl passphrase callback

2019-11-14 Thread Sehrope Sarkuni
On Wed, Nov 13, 2019 at 3:23 PM Tomas Vondra
 wrote:
> I think it would be beneficial to explain why shared object is more
> secure than an OS command. Perhaps it's common knowledge, but it's not
> quite obvious to me.

External command args can be viewed by other OS users (not just the
postgres user). For non-sensitive arguments (ex: WAL filename?) that's
not an issue but if you plan on passing in something potentially
secret value from the backend to the external OS command, that value
would be exposed:

Terminal 1 (run a command as some other user):
$ sudo -u nobody sleep 5

Terminal 2 (view command args as a different non-super user):
$ ps -u nobody -o args
COMMAND
sleep 5

A shared library would not have this problem as the backend directly
executes the library in the same process.

Has the idea of using environment variables (rather than command line
args) for external commands been brought up before? I couldn't find
anything in the mailing list archives.

Environment variables have the advantage of only being readable by the
process owner and super user. They also naturally have a "name" and do
not have escaping or quoting issues.

For example, archive_command %p could be exposed as
PG_ARG_ARCHIVE_PATH and %f could be exposed as
PG_ARG_ARCHIVE_FILENAME. Then you could have a script use them via:

#!/usr/bin/env bash
set -euo pipefail
main () {
local archive_dir="/path/to/archive_dir/"
local archive_file="${archive_dir}${PG_ARG_ARCHIVE_FILENAME}"
test ! -f "${archive_file}" && cp -- "${PG_ARG_ARCHIVE_PATH}"
"${archive_file}"
}
main "$@"

It's not particularly useful for that basic archive case but if
there's something like PG_ARG_SUPER_SECRET then the executed command
could receive that value without it being exposed. That'd be useful
for something like a callout to an external KMS (key management
system).

Nothing stops something like this with being used in tandem with
string substitution to create the full commands. That'd give backwards
compatibility too. The main limitation compared to a shared library is
that you'd still have to explicitly pick and name the exposed argument
environment variables (i.e. like picking the set of %? substitutions).
If there's a generic shared-library-for-external-commands approach
then there could be a built-in library that provides this type of
"expose as env vars" functionality.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: 2019-11-14 Press Release Draft

2019-11-14 Thread Sehrope Sarkuni
> * Several fixes for logical replication, including a failure when the 
> publisher
> & subscriber had different REPLICA IDENTITY columns set.

"&" should probably be "and" as I don't see it used like that in any
other release notes.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: PostgreSQL 12 RC1 Press Release Draft

2019-09-25 Thread Sehrope Sarkuni
The "Upgrading to PostgreSQL 12 RC 1" references v11 rather than v12:

"To upgrade to PostgreSQL 11 RC 1 from Beta 4 or an earlier version of
PostgreSQL 11, ..."

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Sehrope Sarkuni
On Sat, Aug 17, 2019 at 12:43 PM Ibrar Ahmed  wrote:

> +1 for voice call, bruce we usually have a weekly TDE call.
>

Please add me to the call as well. Thanks!

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Sehrope Sarkuni
On Thu, Aug 8, 2019 at 6:31 PM Stephen Frost  wrote:

> Strictly speaking, that isn't actually crash recovery, it's physical
> replication / HA, and while those are certainly nice to have it's no
> guarantee that they're required or that you'd want to have the same keys
> for them- conceptually, at least, you could have WAL with one key that
> both sides know and then different keys for the actual data files, if we
> go with the approach where the WAL is encrypted with one key and then
> otherwise is plaintext.
>

I like the idea of separating the WAL key from the rest of the data files.
It'd all be unlocked by the MDEK and you'd still need derived keys per
WAL-file, but disconnecting all that from the data files solves a lot of
the problems with promoted replicas.

This would complicate cloning a replica as using a different MDEK would
involve decrypting / encrypting everything rather than just copying the
files. Even if that's not baked in a first version, the separation allows
for eventually supporting that.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Sehrope Sarkuni
On Thu, Aug 8, 2019 at 2:16 PM Bruce Momjian  wrote:

> On Wed, Aug  7, 2019 at 08:56:18AM -0400, Sehrope Sarkuni wrote:
> > Simplest approach for derived keys would be to use immutable attributes
> of the
> > WAL files as an input to the key derivation. Something like HKDF(MDEK,
> "WAL:" |
>
> So, I am thinking we should use "WAL:" for WAL and "REL:" for heap/index
> files.
>

Sounds good. Any unique convention is fine. Main thing to keep in mind is
that they're directly tied to the master key so it's not possible to rotate
them without changing the master key.

This is in contrast to saving a WDEK key to a file (similar to how the MDEK
key would be saved) and unlocking it with the MDEK. That has more moving
parts but would allow that key to be independent of the MDEK. In a later
message Stephen refers to an example of a replica receiving encrypted WAL
and applying it with a different MDEK for the page buffers. That's doable
with an independent WDEK.


> > | timeline_id || wal_segment_num) should be fine for this as it is:
>
> I considered using the timeline in the nonce, but then remembered that
> in timeline switch, we _copy_ the part of the old WAL up to the timeline
> switch to the new timeline;  see:
>
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlog.c;h=f55352385732c6b0124eff5265462f3883fe7435;hb=HEAD#l5502
>
>* Initialize the starting WAL segment for the new timeline. If the
> switch
>* happens in the middle of a segment, copy data from the last WAL
> segment
>* of the old timeline up to the switch point, to the starting WAL
> segment
>* on the new timeline.
>
> We would need to decrypt/encrypt to do the copy, and just wasn't sure of
> the value of the timeline in the nonce.  One value to it is that if
> there some WAL that generated after the timeline switch in the old
> primary that isn't transfered, there would be potentially new data
> encrypted with the same key/nonce in the new primary, but if that WAL is
> not used, odds are it is gone/destroyed/inaccessible, or it would have
> been used during the switchover, so it didn't seem worth worrying about.
>
> One _big_ reason to add the timeline is if you had a standby that you
> recovered and rolled forward only to a specific transaction, then
> continued running it as a new primary.  In that case, you would have
> different WAL encrypted with the same key/nonce, but that sounds like
> the same as promoting two standbys, and we should just document not to
> do it.
>
> Maybe we need to consider this further.
>

Good points. Yes, anything short of generating a new key at promotion time
will have these issues. If we're not going to do that, no point in adding
the timeline id if it does not change anything. I had thought only the
combo was unique but sounds like the segment number is unique on its own.
One thing I like about a unique per-file key is that it simplifies the IV
generation (i.e. can start at zero).

What about discarding the rest of the WAL file at promotion and skipping to
a new file? With a random per-file key in the first page header would
ensure that going forward all WAL data is encrypted differently. Combine
that with independent WAL and MDEK keys and everything would be different
between two replicas promoted from the same point.


> > A unique WDEK per WAL file that is derived from the segment number would
> not
> > have that problem. A unique key per-file means the IVs can all start at
> zero
> > and the each file can be treated as one encrypted stream. Any encryption/
> > decryption code would only need to touch the write/read callsites.
>
> So, I am now wondering when we should be using a non-zero nonce to
> start, and when we should be using derived keys.   Should we add the
> page-number to the derived key for heap/index files too and just use the
> LSN for nonce, or add the LSN to the derived key too?
>

The main cost of using multiple keys is that you need to derive or unlock
them for each usage.

A per-type, per-relation, or per-file derived key with the same
non-repeating guarantees for the IV (ex: LSN + Page Number) is as secure
but allows for caching all needed derived keys in memory (it's one per open
file descriptor).

Having page-level derived keys incorporating the LSN + Page Number and
starting the per-page IV at zero works, but you'd have to perform an HKDF
for each page read or write. A cache of those derived keys would be much
larger (32-bytes per page) so presumably you're not going to have them all
cached or maybe not bother with any caching,


> > Even without a per-page MAC, a MAC at some level for WAL has its own
> benefits
> > such as perfect corruption detection. It could be per-record,
> per-N-records,
&g

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-07 Thread Sehrope Sarkuni
On Wed, Aug 7, 2019 at 1:39 PM Bruce Momjian  wrote:

> On Wed, Aug  7, 2019 at 11:41:51AM -0400, Sehrope Sarkuni wrote:
> > On Wed, Aug 7, 2019 at 7:19 AM Bruce Momjian  wrote:
> >
> > On Wed, Aug  7, 2019 at 05:13:31PM +0900, Masahiko Sawada wrote:
> > > I understood. IIUC in your approach postgres processes encrypt WAL
> > > records when inserting to the WAL buffer. So WAL data is encrypted
> > > even on the WAL buffer.
> >
> >
> > I was originally thinking of not encrypting the shared WAL buffers but
> that may
> > have issues. If the buffers are already encrypted and contiguous in
> shared
> > memory, it's possible to write out many via a single pg_pwrite(...) call
> as is
> > currently done in XLogWrite(...).
>
> The shared buffers will not be encrypted --- they are encrypted only
> when being written to storage.  We felt encrypting shared buffers will
> be too much overhead, for little gain.  I don't know if we will encrypt
> while writing to the WAL buffers or while writing the WAL buffers to
> the file system.
>

My mistake on the wording. By "shared WAL buffers" I meant the shared
memory used for WAL buffers, XLogCtl->pages. Not the shared buffers for
pages.


> > If they're not encrypted you'd need to do more work in that critical
> section.
> > That'd involve allocating a commensurate amount of memory to hold the
> encrypted
> > pages and then encrypting them all prior to the single pg_pwrite(...)
> call.
> > Reusing one buffer is possible but it would require encrypting and
> writing the
> > pages one by one. Both of those seem like a bad idea.
>
> Well, right now the 8k pages is part of the WAL stream, so I don't know
> it would be any more overhead than other WAL writes.


The total work is the same but when it happens, memory usage, or number of
syscalls could change.

Right now the XLogWrite(...) code can write many WAL pages at once via a
single call to pg_pwrite(...):
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlog.c;h=f55352385732c6b0124eff5265462f3883fe7435;hb=HEAD#l2491

If the blocks are not encrypted then you either need to allocate and
encrypt everything (could be up to wal_buffers max size) to do it as one
write, or encrypt chunks of WAL and do multiple writes. I'm not sure how
big an issue this would be in practice as it'd be workload specific.


> I am hoping we can
> generate the encryption bit stream in chunks earlier so we can just to
> the XOR was we are writing the data to the WAL buffers.
>

For pure CTR that sounds doable as it'd be the same as doing an XOR with
encrypted zero. Anything with a built-in MAC like GCM would not work though
(I'm not proposing we use that, just keeping it in mind).

You'd also increase your memory requirements (one allocation for the
encrypted stream and one for the encrypted data right?).


> > Better to pay the encryption cost at the time of WAL record creation and
> keep
> > the writing process as fast and simple as possible.
>
> Yes, I don't think we know at the time of WAL record creation what
> _offset_ the records will have when then are written to WAL, so I am
> thinking we need to do it later, and as I said, I am hoping we can
> generate the encryption bit stream earlier.
>
> > > It works but I think the implementation might be complex; For
> example
> > > using openssl, we would use EVP functions to encrypt data by
> > > AES-256-CTR. We would need to make IV and pass it to them and these
> > > functions however don't manage the counter value of nonce as long
> as I
> > > didn't miss. That is, we need to calculate the correct counter
> value
> > > for each encryption and pass it to EVP functions. Suppose we
> encrypt
> > > 20 bytes of WAL. The first 16 bytes is encrypted with nonce of
> > > (segment_number, 0) and the next 4 bytes is encrypted with nonce of
> > > (segment_number, 1). After that suppose we encrypt 12 bytes of
> WAL. We
> > > cannot use nonce of (segment_number, 2) but should use nonce of
> > > (segment_number , 1). Therefore we would need 4 bytes padding and
> to
> > > encrypt it and then to throw that 4 bytes away .
> >
> > Since we want to have per-byte control over encryption, for both
> > heap/index pages (skip LSN and CRC), and WAL (encrypt to the last
> byte),
> > I assumed we would need to generate a bit stream of a specified size
> and
> > do the XOR ourselves against the data.  I assume ssh does this, so we
> > would have to study the method.
> >
> >
> > The lower level non-EVP

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-07 Thread Sehrope Sarkuni
On Wed, Aug 7, 2019 at 7:19 AM Bruce Momjian  wrote:

> On Wed, Aug  7, 2019 at 05:13:31PM +0900, Masahiko Sawada wrote:
> > I understood. IIUC in your approach postgres processes encrypt WAL
> > records when inserting to the WAL buffer. So WAL data is encrypted
> > even on the WAL buffer.
>

I was originally thinking of not encrypting the shared WAL buffers but that
may have issues. If the buffers are already encrypted and contiguous in
shared memory, it's possible to write out many via a single pg_pwrite(...)
call as is currently done in XLogWrite(...).

If they're not encrypted you'd need to do more work in that critical
section. That'd involve allocating a commensurate amount of memory to hold
the encrypted pages and then encrypting them all prior to the single
pg_pwrite(...) call. Reusing one buffer is possible but it would require
encrypting and writing the pages one by one. Both of those seem like a bad
idea.

Better to pay the encryption cost at the time of WAL record creation and
keep the writing process as fast and simple as possible.


> > It works but I think the implementation might be complex; For example
> > using openssl, we would use EVP functions to encrypt data by
> > AES-256-CTR. We would need to make IV and pass it to them and these
> > functions however don't manage the counter value of nonce as long as I
> > didn't miss. That is, we need to calculate the correct counter value
> > for each encryption and pass it to EVP functions. Suppose we encrypt
> > 20 bytes of WAL. The first 16 bytes is encrypted with nonce of
> > (segment_number, 0) and the next 4 bytes is encrypted with nonce of
> > (segment_number, 1). After that suppose we encrypt 12 bytes of WAL. We
> > cannot use nonce of (segment_number, 2) but should use nonce of
> > (segment_number , 1). Therefore we would need 4 bytes padding and to
> > encrypt it and then to throw that 4 bytes away .
>
> Since we want to have per-byte control over encryption, for both
> heap/index pages (skip LSN and CRC), and WAL (encrypt to the last byte),
> I assumed we would need to generate a bit stream of a specified size and
> do the XOR ourselves against the data.  I assume ssh does this, so we
> would have to study the method.
>

The lower level non-EVP OpenSSL functions allow specifying the offset
within the 16-byte AES block from which the encrypt/decrypt should proceed.
It's the "num" parameter of their encrypt/decrypt functions. For a
continuous encrypted stream such as a WAL file, a "pread(...)" of a
possibly non-16-byte aligned section would involve determining the 16-byte
counter (byte_offset / 16) and the intra-block offset (byte_offset % 16).
I'm not sure how one handles initializing the internal encrypted counter
and that might be one more step that would need be done. But it's
definitely possible to read / write less than a block via those APIs (not
the EVP ones).

I don't think the EVP functions have parameters for the intra-block offset
but you can mimic it by initializing the IV/block counter and then skipping
over the intra-block offset by either reading or writing a dummy partial
block. The EVP read and write functions both deal with individual bytes so
once you've seeked to your desired offset you can read or write the real
individual bytes.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-07 Thread Sehrope Sarkuni
On Mon, Aug 5, 2019 at 9:02 PM Bruce Momjian  wrote:

> On Wed, Jul 31, 2019 at 09:25:01AM -0400, Sehrope Sarkuni wrote:
> > Even if we do not include a separate per-relation salt or things like
> > relfilenode when generating a derived key, we can still include other
> types of
> > immutable attributes. For example the fork type could be included to
> eventually
> > allow multiple forks for the same relation to be encrypted with the same
> IV =
> > LSN + Page Number as the derived key per-fork would be distinct.
>
> Yes, the fork number could be useful in this case.  I was thinking we
> would just leave the extra bits as zeros and we can then set it to '1'
> or something else for a different fork.
>

Key derivation has more flexibility as you're not limited by the number of
unused bits in the IV.


> > WAL encryption should not use the same key as page encryption so there's
> no
> > need to design the IV to try to avoid matching the page IVs. Even a basic
> > derivation with a single fixed WDEK = HKDF(MDEK, "WAL") and TDEK =
> HKDF(MDEK,
> > "PAGE") would ensure separate keys. That's the the literal string "WAL"
> or
> > "PAGE" being added as a salt to generate the respective keys, all that
> matters
> > is they're different.
>
> I was thinking the WAL would use the same key since the nonce is unique
> between the two.  What value is there in using a different key?
>

Never having to worry about overlap in Key + IV usage is main advantage.
While it's possible to structure IVs to avoid that from happening, it's
much easier to completely avoid that situation by ensuring different parts
of an application are using separate derived keys.


> > Ideally WAL encryption would generating new derived keys as part of the
> WAL
> > stream. The WAL stream is not fixed so you have the luxury of being able
> to add
> > a "Use new random salt XZY going forward" records. Forcing generation of
> a new
> > salt/key upon promotion of a replica would ensure that at least the WAL
> is
> > unique going forward. Could also generate a new upon server startup,
> after
>
> Ah, yes, good point, and using a derived key would make that easier.
> The tricky part is what to use to create the new derived key, unless we
> generate a random number and store that somewhere in the data directory,
> but that might lead to fragility, so I am worried.


Simplest approach for derived keys would be to use immutable attributes of
the WAL files as an input to the key derivation. Something like HKDF(MDEK,
"WAL:" || timeline_id || wal_segment_num) should be fine for this as it is:

* Unique per WAL file
* Known prior to writing to a given WAL file
* Known prior to reading a given WAL file
* Does not require any additional persistence

We have pg_rewind,
> which allows to make the WAL go backwards.  What is the value in doing
> this?
>

Good point re: pg_rewind. Having key rotation records in the stream would
complicate that as you'd have to jump back / forward to figure out which
key to use. It's doable but much more complicated.

A unique WDEK per WAL file that is derived from the segment number would
not have that problem. A unique key per-file means the IVs can all start at
zero and the each file can be treated as one encrypted stream. Any
encryption/decryption code would only need to touch the write/read
callsites.


> > every N bytes, or a new one for each new WAL file. There's much more
> > flexibility compared to page encryption.
> >
> > As WAL is a single continuous stream, we can start the IV for each
> derived WAL
> > key from zero. There's no need to complicate it further as Key + IV will
> never
> > be reused.
>
> Uh, you want a new random key for each WAL file?  I was going to use the
> WAL segment number as the nonce, which is always increasing, and easily
> determined.  The file is 16MB.
>

Ideally yes as it would allow for multiple replicas promoted off the same
primary to immediately diverge as each would have its own keys. I don't
consider it a requirement but if it's possible without significant added
complexity I say that's a win.

I'm still reading up on the file and record format to understand how
complex that would be. Though given your point re: pg_rewind and the lack
of handling for page encryption divergence when promoting multiple
replicas, I doubt the complexity will be worth it.


> > If WAL is always written as full pages we need to ensure that the empty
> parts
> > of the page are actual zeros and not "encrypted zeroes". Otherwise an
> XOR of
> > the empty section of the first write of a page against a subsequent one
> would
> > give you the plain text.
>
> Right, I think w

Re: Fix typos

2019-08-02 Thread Sehrope Sarkuni
On Fri, Aug 2, 2019 at 12:11 AM Michael Paquier  wrote:

> On Thu, Aug 01, 2019 at 11:01:59PM -0400, Alvaro Herrera wrote:
> > I think slight variations don't really detract from the value of the
> > product, and consider the odd variation a reminder of the diversity of
> > the project.  I don't suggest that we purposefully introduce spelling
> > variations, or that we refrain from fixing ones that appear in code
> > we're changing, but I don't see the point in changing a line for the
> > sole reason of standardising the spelling of a word.
>
> Agreed.  This always reminds me of ANALYZE vs. ANALYSE where we don't
> actually document the latter :)
>

I didn't know about that. That's a fun one!

>
> > That said, I'm not a native English speaker.
>
> Neither am I.
>

I am. Consistency is nice but either reads fine to me. Only brought it up
as I didn't see many other usages so seemed out of place.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Fix typos

2019-08-02 Thread Sehrope Sarkuni
On Thu, Aug 1, 2019 at 10:18 PM Tom Lane  wrote:

> It's British vs. American spelling.  For the most part, Postgres
> follows American spelling, but there's the odd Briticism here and
> there.


Thanks for the explanation. I thought that might be the case but didn't
find any other usages of "serialise" so was not sure.


>   I'm not sure whether it's worth trying to standardize.
> I think the most recent opinion on this was Munro's:
>
>
> https://www.postgresql.org/message-id/ca+hukgjz-pdmgwxroiwvn-aeg4-ajdwj3gwdqkosa8g65sp...@mail.gmail.com


Either reads fine to me and the best rationale I can think of for going
with one spelling is to not have the same "fix" come up again.

If there is a desire to change this, attached is updated to include one
more instance of "materialise" and a change to the commit message to match
some similar ones I found in the past.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
From 20c018b9350cb17e87930cebb66b715bc4b9fcf4 Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Fri, 2 Aug 2019 05:58:35 -0400
Subject: [PATCH] Use American spelling for "serialize" and "materalize"

---
 contrib/pgstattuple/pgstatapprox.c  | 2 +-
 contrib/xml2/xpath.c| 2 +-
 src/backend/access/transam/README   | 4 ++--
 src/backend/storage/buffer/bufmgr.c | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 636c8d40ac..83b3270554 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -278,7 +278,7 @@ pgstattuple_approx_internal(Oid relid, FunctionCallInfo fcinfo)
  errmsg("cannot access temporary tables of other sessions")));
 
 	/*
-	 * We support only ordinary relations and materialised views, because we
+	 * We support only ordinary relations and materialized views, because we
 	 * depend on the visibility map and free space map for our estimates about
 	 * unscanned pages.
 	 */
diff --git a/contrib/xml2/xpath.c b/contrib/xml2/xpath.c
index 1e5b71d9a0..11749b6c47 100644
--- a/contrib/xml2/xpath.c
+++ b/contrib/xml2/xpath.c
@@ -573,7 +573,7 @@ xpath_table(PG_FUNCTION_ARGS)
  errmsg("xpath_table must be called as a table function")));
 
 	/*
-	 * We want to materialise because it means that we don't have to carry
+	 * We want to materialize because it means that we don't have to carry
 	 * libxml2 parser state between invocations of this function
 	 */
 	if (!(rsinfo->allowedModes & SFRM_Materialize))
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index ad4083eb6b..49b6809c82 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -609,8 +609,8 @@ or more buffer locks be held concurrently, you must lock the pages in
 appropriate order, and not release the locks until all the changes are done.
 
 Note that we must only use PageSetLSN/PageGetLSN() when we know the action
-is serialised. Only Startup process may modify data blocks during recovery,
-so Startup process may execute PageGetLSN() without fear of serialisation
+is serialized. Only Startup process may modify data blocks during recovery,
+so Startup process may execute PageGetLSN() without fear of serialization
 problems. All other processes must only call PageSet/GetLSN when holding
 either an exclusive buffer lock or a shared lock plus buffer header lock,
 or be writing the data block directly rather than through shared buffers
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f3a402854..3d9e0a3488 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3519,7 +3519,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 			/*
 			 * Set the page LSN if we wrote a backup block. We aren't supposed
 			 * to set this when only holding a share lock but as long as we
-			 * serialise it somehow we're OK. We choose to set LSN while
+			 * serialize it somehow we're OK. We choose to set LSN while
 			 * holding the buffer header lock, which causes any reader of an
 			 * LSN who holds only a share lock to also obtain a buffer header
 			 * lock before using PageGetLSN(), which is enforced in
-- 
2.17.1



Fix typos

2019-08-01 Thread Sehrope Sarkuni
Hi,

Attached fixes some typos for "serialise" => "serialize" and "materialise"
=> "materialize".

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
From 30d34d082120ac2c041a4ad45e9d6e31b0ea9c9d Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Thu, 1 Aug 2019 08:00:23 -0400
Subject: [PATCH] Fix typos

---
 contrib/xml2/xpath.c| 2 +-
 src/backend/access/transam/README   | 4 ++--
 src/backend/storage/buffer/bufmgr.c | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/contrib/xml2/xpath.c b/contrib/xml2/xpath.c
index 1e5b71d9a0..11749b6c47 100644
--- a/contrib/xml2/xpath.c
+++ b/contrib/xml2/xpath.c
@@ -573,7 +573,7 @@ xpath_table(PG_FUNCTION_ARGS)
  errmsg("xpath_table must be called as a table function")));
 
 	/*
-	 * We want to materialise because it means that we don't have to carry
+	 * We want to materialize because it means that we don't have to carry
 	 * libxml2 parser state between invocations of this function
 	 */
 	if (!(rsinfo->allowedModes & SFRM_Materialize))
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index ad4083eb6b..49b6809c82 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -609,8 +609,8 @@ or more buffer locks be held concurrently, you must lock the pages in
 appropriate order, and not release the locks until all the changes are done.
 
 Note that we must only use PageSetLSN/PageGetLSN() when we know the action
-is serialised. Only Startup process may modify data blocks during recovery,
-so Startup process may execute PageGetLSN() without fear of serialisation
+is serialized. Only Startup process may modify data blocks during recovery,
+so Startup process may execute PageGetLSN() without fear of serialization
 problems. All other processes must only call PageSet/GetLSN when holding
 either an exclusive buffer lock or a shared lock plus buffer header lock,
 or be writing the data block directly rather than through shared buffers
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f3a402854..3d9e0a3488 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3519,7 +3519,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 			/*
 			 * Set the page LSN if we wrote a backup block. We aren't supposed
 			 * to set this when only holding a share lock but as long as we
-			 * serialise it somehow we're OK. We choose to set LSN while
+			 * serialize it somehow we're OK. We choose to set LSN while
 			 * holding the buffer header lock, which causes any reader of an
 			 * LSN who holds only a share lock to also obtain a buffer header
 			 * lock before using PageGetLSN(), which is enforced in
-- 
2.17.1



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-31 Thread Sehrope Sarkuni
On Wed, Jul 31, 2019 at 2:32 AM Masahiko Sawada 
wrote:

> Just to confirm, we have 21 bits left for nonce in CTR? We have LSN (8
> bytes), page-number (4 bytes) and counter (11 bits) in 16 bytes nonce
> space. Even though we have 21 bits left we cannot store relfilenode to
> the IV.
>

Fields like relfilenode, database, or tablespace could be added to the
derived key, not the per-page IV. There's no space limitations as they are
additional inputs into the HKDF (key derivation function).

Additional salt of any size, either some stored random value or something
deterministic like the relfilenode/database/tablespace, can be added to the
HKDF. There's separate issues with including those specific fields but it's
not a size limitation.


> For WAL encryption,  before flushing WAL we encrypt whole 8k WAL page
> and then write only the encrypted data of the new WAL record using
> pg_pwrite() rather than write whole encrypted page. So each time we
> encrypt 8k WAL page we end up with encrypting different data with the
> same key+nonce but since we don't write to the disk other than space
> where we actually wrote WAL records it's not a problem. Is that right?
>

Ah, this is what I was referring to in my previous mail. I'm not familiar
with how the writes happen yet (reading up...) but, yes, we would need to
ensure that encrypted data is not written more than once (i.e. no writing
of encrypt(zero) followed by writing of encrypt(non-zero) at the same spot).

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-31 Thread Sehrope Sarkuni
On Tue, Jul 30, 2019 at 4:48 PM Bruce Momjian  wrote:

> I had more time to think about the complexity of adding relfilenode to
> the IV.  Since relfilenode is only unique within a database/tablespace,
> we would need to have pg_upgrade preserve database/tablespace oids
> (which I assume are the same as the directory and tablespace symlinks).
> Then, to decode a page, you would need to look up those values.  This is
> in addition to the new complexity of CREATE DATABASE and moving files
> between tablespaces.  I am also concerned that crash recovery operations
> and cluster forensics and repair would need to also deal with this.
>
> I am not even clear if pg_upgrade preserving relfilenode is possible ---
> when we wrap the relfilenode counter, does it start at 1 or at the
> first-user-relation-oid?  If the former, it could conflict with oids
> assigned to new system tables in later major releases.  Tying the
> preservation of relations to two restrictions seems risky.
>

Agreed. Unless you know for sure the input is going to immutable across
copies or upgrades, including anything in either the IV or key derivation
gets risky and could tie you down for the future. That's partly why I like
the idea separate salt (basically you directly pay for the complexity by
tracking that).

Even if we do not include a separate per-relation salt or things like
relfilenode when generating a derived key, we can still include other types
of immutable attributes. For example the fork type could be included to
eventually allow multiple forks for the same relation to be encrypted with
the same IV = LSN + Page Number as the derived key per-fork would be
distinct.


> Using just the page LSN and page number allows a page to be be
> decrypted/encrypted independently of its file name, tablespace, and
> database, and I think that is a win for simplicity.  Of course, if it is
> insecure we will not do it.
>

As LSN + Page Number combo is unique for all relations (not just one
relation) I think we're good for pages.

I am thinking for the heap/index IV, it would be:
>
> uint64 lsn;
> unint32 page number;
> /* only uses 11 bits for a zero-based CTR counter for 32k pages */
> uint32 counter;
>

Looks good.


> and for WAL it would be:
>
> uint64 segment_number;
> uint32counter;
> /* guarantees this IV doesn't match any relation IV */
> uint32   2^32-1 /* all 1's */
>

I need to read up more on the structure of the WAL records but here's some
high level thoughts:

WAL encryption should not use the same key as page encryption so there's no
need to design the IV to try to avoid matching the page IVs. Even a basic
derivation with a single fixed WDEK = HKDF(MDEK, "WAL") and TDEK =
HKDF(MDEK, "PAGE") would ensure separate keys. That's the the literal
string "WAL" or "PAGE" being added as a salt to generate the respective
keys, all that matters is they're different.

Ideally WAL encryption would generating new derived keys as part of the WAL
stream. The WAL stream is not fixed so you have the luxury of being able to
add a "Use new random salt XZY going forward" records. Forcing generation
of a new salt/key upon promotion of a replica would ensure that at least
the WAL is unique going forward. Could also generate a new upon server
startup, after every N bytes, or a new one for each new WAL file. There's
much more flexibility compared to page encryption.

As WAL is a single continuous stream, we can start the IV for each derived
WAL key from zero. There's no need to complicate it further as Key + IV
will never be reused.

If WAL is always written as full pages we need to ensure that the empty
parts of the page are actual zeros and not "encrypted zeroes". Otherwise an
XOR of the empty section of the first write of a page against a subsequent
one would give you the plain text.

The non-fixed size of the WAL allows for the addition of a MAC though I'm
not sure yet the best way to incorporate it. It could be part of each
encrypted record or its own summary record (providing a MAC for a series of
WAL records). After I've gone through this a bit more I'm looking to put
together a write up with this and some other thoughts in one place.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Sehrope Sarkuni
On Tue, Jul 30, 2019 at 10:06 AM Masahiko Sawada 
wrote:

> On Tue, Jul 30, 2019 at 5:03 AM Sehrope Sarkuni 
> wrote:
> >
> > On Mon, Jul 29, 2019 at 9:44 AM Bruce Momjian  wrote:
> >>
> >> > Checking that all buffers using a single LSN are from the same
> >> > relation would be a good idea but I think it's hard to test it and
> >> > regard the test result as okay. Even if we passed 'make checkworld',
> >> > it might still be possible to happen. And even assertion failures
> >>
> >> Yes, the problem is that if you embed the relfilenode or tablespace or
> >> database in the encryption IV, you then need to then make sure you
> >> re-encrypt any files that move between these.  I am hesitant to do that
> >> since it then requires these workarounds for encryption going forward.
> >> We know that most people will not be using encryption, so that will not
> >> be well tested either.  For pg_upgrade, I used a minimal-impact
> >> approach, and it has allowed dramatic changes in our code without
> >> requiring changes and retesting of pg_upgrade.
> >
> >
> > Will there be a per-relation salt stored in a separate file? I saw it
> mentioned in a few places (most recently
> https://www.postgresql.org/message-id/aa386c3f-fb89-60af-c7a3-9263a633ca1a%40postgresql.org)
> but there's also discussion of trying to make the TDEK unique without a
> separate salt so I'm unsure.
> >
> > With a per-relation salt there is no need to include fixed attributes
> (database, relfilenode, or tablespace) to ensure the derived key is unique
> per relation. A long salt (32-bytes from /dev/urandom) alone guarantees
> that uniqueness. Copying or moving files would then be possible by also
> copying the salt. It does not need to be a salt per file on disk either,
> one salt can be used for many files for the same relation by including the
> fork number, type, or segment in the TDEK derivation (so each file on disk
> for that relation ends up with a unique TDEK).
>
> If we can derive unique TDEK using (database, tablespace, relfilenode)
> as info I think it's better to use it rather than using random salt
> per relations since it doesn't require additional information we need
> to store. As described in HKDF RFC[1], if the input key is already
> present as a cryptographically strong key we can skip the extract part
> where use a salt.
>
> [1] https://tools.ietf.org/html/rfc5869


Yes a random salt is not required for security reasons. Any unique values
mixed into the HKDF is fine and the derived keys will still be unique. The
HKDF ensures that uniqueness.

The separate salt allows you to disconnect the key derivation from the
physical attributes of the file. The physical attributes (ex: database,
tablespace, file node) are very convenient as they're unique and do not
require additional storage. However using them prevents copying or moving
the encrypted files as one or more of them would be different at the
destination (so the derived key would no longer decrypt the existing data).
So you would have to decrypt / encrypt everything as part of a copy.

If copying raw files without a decrypt/encrypt cycle is desired then the
key derivation cannot include physical attributes (or per Bruce's note
above, there would be no separate key derivation relation). I thought it'd
be a nice property to have as it limits the amount of code that needs to be
crypto aware (ex: copying a database or moving a table to a different
tablespace would not change beyond ensuring the salt is also copied).

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Sehrope Sarkuni
On Tue, Jul 30, 2019 at 8:16 AM Bruce Momjian  wrote:

> On Tue, Jul 30, 2019 at 07:44:20AM -0400, Sehrope Sarkuni wrote:
> > If each relation file has its own derived key, the derived TDEK for that
> > relation file, then there is no issue with reusing the same IV = LSN ||
> Page
> > Number. The TDEKs will be different so Key + IV will never collide.
>
> So, this email explains that we are considering not using the
> relfilenode/tablepsace/database to create a derived key per relation,
> but us the same key for all relaions because the IV will be unique per
> page across all relations:
>
>
> https://www.postgresql.org/message-id/20190729134442.2bxakegiqafxg...@momjian.us
>
> There is talk of using a derived key with a contant to make sure all
> heap/index files use a different derived key than WAL, but I am not
> sure.  This is related to whether WAL IV and per-heap/index IV can
> collide.
>

Ah, I read that to imply that derived keys were a must. Specifically this
piece at the end:

>From Joe's email on 2019-07-13 18:41:34:
>> Based on all of that I cannot find a requirement that we use more than
one key per database.
>>
>> But I did find that files in an encrypted file system are encrypted with
derived keys from a master key, and I view this as analogous to what we
are doing.

I read that as the "one key per database" is the MDEK.

And read the part about derived keys as referring to separate derived keys
for relations. Perhaps I misread and it was referring to different keys for
WAL vs pages.


> There are other emails in the thread that also discuss the topic.  The
> issue is that we add a lot of complexity to other parts of the system,
> (e.g. pg_upgrade, CREATE DATABASE, moving relations between tablespaces)
> to create a derived key, so we should make sure we need it before we do
> it.
>

Yes it definitely complicates things both on the derivation and potential
additional storage for the salts (they're small and fixed size, but you
still need to put it somewhere).

I think key rotation for TDEK will be impossible without some stored salt
and per-relation derived key. It might not be needed in a first cut though
as the "default salt" could be no salt or a place holder of all zeroes.
Even if the rotation itself is out of scope for a first pass the potential
to eventually add it should be there.

Should keep in mind that because we do not have a MAC on the encrypted
pages we'll need to know which derived key to use. We can't try multiple
options to see which is correct as any key would "succeed" with garbage
decrypted data.


> > In general it's fine to use the same IV with different keys. Only reuse
> of Key
> > + IV is a problem and the entire set of possible counter values (IV + 0,
> IV +
> > 1, ...) generated with a key must be unique. That's also why we must
> leave at
> > least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the IV to be
> filled
> > in with 0, 1, 2, ... for each 16-byte AES-block on the page. If our
> per-page IV
> > prefix used any of those bits then the counter could overflow into the
> next
> > page's IV's range.
>
> Agreed.
>
> Attached is an updated patch that checks only main relation forks, which
> I think are the only files we are going ot encrypt, and it has better
> comments.
>

Okay that makes sense in the context of using a single key and relying on
the LSN based IV to be unique.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Sehrope Sarkuni
On Tue, Jul 30, 2019 at 8:16 AM Masahiko Sawada 
wrote:

> On Mon, Jul 29, 2019 at 8:18 PM Sehrope Sarkuni 
> wrote:
> >
> > On Mon, Jul 29, 2019 at 6:42 AM Masahiko Sawada 
> wrote:
> > > > An argument could be made to push that problem upstream, i.e. let the
> > > > supplier of the passphrase deal with the indirection. You would still
> > > > need to verify the supplied passphrase/key is correct via something
> > > > like authenticating against a stored MAC.
> > >
> > > So do we need the key for MAC of passphrase/key in order to verify?
> >
> > Yes. Any 128 or 256-bit value is a valid AES key and any 16-byte input
> > can be "decrypted" with it in both CTR and CBC mode, you'll just end
> > up with garbage data if the key does not match. Verification of the
> > key prior to usage (i.e. starting DB and encrypting/decrypting data)
> > is a must as otherwise you'll end up with all kinds of corruption or
> > data loss.
> >
>
> Do you mean that we encrypt and store a 16 byte input with the correct
> key to the disk, and then decrypt it with the user supplied key and
> compare the result to the input data?
>

Yes but we don't compare via decryption of a known input. We instead
compute a MAC of the encrypted master key using the user supplied key, and
compare that against an expected MAC stored alongside the encrypted master
key.

The pseudo code would be something like:

// Read key text from user:
string raw_kek = read_from_user()
// Normalize it to a fixed size of 64-bytes
byte[64] kek = SHA512(SHA512(raw_kek))
// Split the 64-bytes into a separate encryption and MAC key
byte[32] user_encryption_key = kek.slice(0,32)
byte[32] user_mac_key = kek.slice(32, 64)

// Read our saved MAC and encrypted master key
byte[80] mac_iv_encrypted_master_key = read_from_file()
// First 32-bytes is the MAC of the rest
byte[32] expected_mac = mac_iv_encrypted_master_key.slice(0, 32)
// Rest is a random IV + Encrypted master key
byte[48] iv_encrypted_master_key = mac_iv_encrypted_master_key(32, 80)

// Compute the MAC with the user supplied key
byte[32] actual_mac = HMAC(user_mac_key, iv_encrypted_master_key)
// If it does not match then the user key is invalid
if (actual_mac != expected_mac) {
  print_err_and_exit("Bad user key!")
}

// Our MAC was correct
// ... so we know user supplied key is valid
// ... and we know our iv and encrypted_key are valid
byte[16] iv = iv_encrypted_master_key.slice(0,16)
byte[32] encrypted_master_key = iv_encrypted_master_key.slice(16, 48)
// ... so we can use all three to decrypt the master key (MDEK)
byte[32] master_key = decrypt_aes_cbc(user_encryption_key, iv,
encrypted_master_key)


> From a single user supplied passphrase you would derive the MDEK and
> > compute a MAC (either using the same key or via a separate derived
> > MDEK-MAC key). If the computed MAC matches against the previously
> > stored value then you know the MDEK is correct as well.
>
> You meant KEK, not MDEK?
>

If the KEK is incorrect then the MAC validation would fail and the decrypt
would never be attempted.

If the MAC matches then both the KEK (user supplied key) and MDEK
("master_key" in the pseudo code above) would be confirmed to be valid. So
the MDEK is safe to use for deriving keys for encrypt / decrypt.

I'm using the definitions for "KEK" and "MDEK" from Joe's mail
https://www.postgresql.org/message-id/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 8:35 PM Bruce Momjian  wrote:

> On Sun, Jul 28, 2019 at 10:33:03PM -0400, Bruce Momjian wrote:
> > I am thinking of writing some Assert() code that checks that all buffers
> > using a single LSN are from the same relation (and therefore different
> > page numbers).  I would do it by creating a static array, clearing it on
> > XLogBeginInsert(), adding to it for each  XLogInsert(), then checking on
> > PageSetLSN() that everything in the array is from the same file.  Does
> > that make sense?
>
> So, I started looking at how to implement the Assert checks and found
> that Heikki has already added (in commit 2c03216d83) Assert checks to
> avoid duplicate block numbers in WAL.  I just added the attached patch
> to check that all RelFileNodes are the same.
>

>From the patch:

/*
! * The initialization vector (IV) is used for page-level
! * encryption.  We use the LSN and page number as the IV, and IV
! * values must never be reused since it is insecure. It is safe
! * to use the LSN on multiple pages in the same relation since
! * the page number is part of the IV.  It is unsafe to reuse the
! * LSN in different relations because the page number might be
! * the same, and hence the IV.  Therefore, we check here that
! * we don't have WAL records for different relations using the
! * same LSN.
! */

If each relation file has its own derived key, the derived TDEK for that
relation file, then there is no issue with reusing the same IV = LSN ||
Page Number. The TDEKs will be different so Key + IV will never collide.

In general it's fine to use the same IV with different keys. Only reuse of
Key + IV is a problem and the entire set of possible counter values (IV +
0, IV + 1, ...) generated with a key must be unique. That's also why we
must leave at least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the
IV to be filled in with 0, 1, 2, ... for each 16-byte AES-block on the
page. If our per-page IV prefix used any of those bits then the counter
could overflow into the next page's IV's range.

I ran the regression tests with asserts on and got no failures, so I
> think we are good.
>

It's not strictly required but it also doesn't hurt that LSN is unique
per-relation so that's still good news!

Might be useful for something down the road like a separate stream of MACs
computed per-LSN.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 4:15 PM Alvaro Herrera 
wrote:

> On 2019-Jul-27, Sehrope Sarkuni wrote:
>
> > Given the non-cryptographic nature of CRC and its 16-bit size, I'd
> > round down the malicious tamper detection it provides to zero. At best
> > it catches random disk errors so might as well keep it in plain text
> > and checkable offline.
>
> But what attack are we protecting against?  We fear that somebody will
> steal a disk or a backup.  We don't fear that they will *write* data.
> The CRC is there to protect against data corruption.  So whether or not
> the CRC protects against malicious tampering is beside the point.
>

That was in response to using an encrypted CRC for tamper detection. I
agree that it does not provide meaningful protection so there is no point
in adding complexity to use it for that.

I agree it's better to leave the CRC as-is for detecting corruption which
also has the advantage of playing nice with existing checksum tooling.


> If we were trying to protect against an attacker having access to
> *writing* data in the production server, this encryption scheme is
> useless: they could just as well read unencrypted data from shared
> buffers anyway.
>

The attack situation is someone being able to modify pages at the storage
tier. They cannot necessarily read server memory or the encryption key, but
they could make changes to existing data or an existing backup that would
be subsequently read by the server.

Dealing with that is way out of scope but similar to the replica promotion
I think it should be kept track of and documented.


> I think trying to protect against malicious data tampering is a second
> step *after* this one is done.
>

+1

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 4:10 PM Alvaro Herrera 
wrote:

> On 2019-Jul-27, Bruce Momjian wrote:
>
> > I think using LSN and page number, we will _never_ reuse the IV, except
> > for cases like promoting two standbys, which I think we have to document
> > as an insecure practice.
>
> Actually, why is it an insecure practice?  If you promote two standbys,
> then the encrypted pages are the same pages, so it's not two different
> messages with the same key/IV -- they're still *one* message.  And as
> soon as they start getting queries, they will most likely diverge
> because the LSNs of records after the promotion will (most likely) no
> longer match.  It takes one different WAL record length for the
> "encryption histories" to diverge completely ...
>

You could have a sequence of post promotion events like:

# Replica 1
LSN=t+0 Operation A
LSN=t+1 Operation B
...
LSN=t+n Operation C

# Replica 2
LSN=t+0 Operation X
LSN=t+1 Operation Y
...
LSN=t+n Operation Z

If the LSN and modified page numbers of C and Z are the same
... and the net effect of Z is known (ex: setting a bunch of bytes on the
row to zero)
... and you can read the encrypted pages of both replicas (ex: have access
to the encrypted storage tier but not necessarily the live server)
... then you can XOR the encrypted pages to get the plain text for the
bytes after operation C.

Yes, it's not likely and yes it has a lot of "if..." involved, but it is
possible.

I don't think this will be an issue in practice, but it should be
documented. Otherwise, it's not unreasonable for someone to expect that a
promoted replica would use be using new keys for everything after each
promotion.

Encryption for WAL can avoid this type of problem entirely by generating a
new random salt and adding a "Use new salt XYZ for WDEK going forward"
record. The two replicas would generate different salts so all subsequent
encrypted WAL data would be different (even the exact same records).
Unfortunately, that doesn't work for pages without a lot more complexity to
keep track of which key version to use based upon the LSN.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 9:44 AM Bruce Momjian  wrote:

> > Checking that all buffers using a single LSN are from the same
> > relation would be a good idea but I think it's hard to test it and
> > regard the test result as okay. Even if we passed 'make checkworld',
> > it might still be possible to happen. And even assertion failures
>
> Yes, the problem is that if you embed the relfilenode or tablespace or
> database in the encryption IV, you then need to then make sure you
> re-encrypt any files that move between these.  I am hesitant to do that
> since it then requires these workarounds for encryption going forward.
> We know that most people will not be using encryption, so that will not
> be well tested either.  For pg_upgrade, I used a minimal-impact
> approach, and it has allowed dramatic changes in our code without
> requiring changes and retesting of pg_upgrade.
>

Will there be a per-relation salt stored in a separate file? I saw it
mentioned in a few places (most recently
https://www.postgresql.org/message-id/aa386c3f-fb89-60af-c7a3-9263a633ca1a%40postgresql.org)
but there's also discussion of trying to make the TDEK unique without a
separate salt so I'm unsure.

With a per-relation salt there is no need to include fixed attributes
(database, relfilenode, or tablespace) to ensure the derived key is unique
per relation. A long salt (32-bytes from /dev/urandom) alone guarantees
that uniqueness. Copying or moving files would then be possible by also
copying the salt. It does not need to be a salt per file on disk either,
one salt can be used for many files for the same relation by including the
fork number, type, or segment in the TDEK derivation (so each file on disk
for that relation ends up with a unique TDEK).

There's the usual gotchas of copying encrypted data, i.e. it's exactly the
same so clearly they're equal. But any subsequent changes would have a
different LSN and encrypt differently going forward. If the main use cases
are copying an entire database or moving a tablespace, having that be
simpler/faster seems like a good idea. It could be a known limitation like
the promoting of multiple replicas. Plus with a key rotation tool anyone
that wants everything re-encrypted could run one after the copy.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 6:42 AM Masahiko Sawada  wrote:
> > An argument could be made to push that problem upstream, i.e. let the
> > supplier of the passphrase deal with the indirection. You would still
> > need to verify the supplied passphrase/key is correct via something
> > like authenticating against a stored MAC.
>
> So do we need the key for MAC of passphrase/key in order to verify?

Yes. Any 128 or 256-bit value is a valid AES key and any 16-byte input
can be "decrypted" with it in both CTR and CBC mode, you'll just end
up with garbage data if the key does not match. Verification of the
key prior to usage (i.e. starting DB and encrypting/decrypting data)
is a must as otherwise you'll end up with all kinds of corruption or
data loss.

>From a single user supplied passphrase you would derive the MDEK and
compute a MAC (either using the same key or via a separate derived
MDEK-MAC key). If the computed MAC matches against the previously
stored value then you know the MDEK is correct as well.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 4:39 AM Masahiko Sawada  wrote:
> After more thoughts, I'm confused why we need to have MDEK. We can use
> KEK derived from passphrase and TDEK and WDEK that are derived from
> KEK. That way, we don't need store any key in database file. What is
> the advantage of 3-tier key architecture?

The separate MDEK serves a couple purposes:

1. Allows for rotating the passphrase without actually changing any of
the downstream derived keys.
2. Verification that the passphrase itself is correct by checking if
it can unlock and authenticate (via a MAC) the MDEK.
3. Ensures it's generated from a strong random source (ex: /dev/urandom).

If the MDEK was directly derived via a deterministic function of the
passphrase, then that passphrase could never change as all your
derived keys would also change (and thus could not be decrypt their
existing data). The encrypted MDEK provides a level of indirection for
passphrase rotation.

An argument could be made to push that problem upstream, i.e. let the
supplier of the passphrase deal with the indirection. You would still
need to verify the supplied passphrase/key is correct via something
like authenticating against a stored MAC. If you're going to do that,
might as well directly support decrypting and managing your own MDEK.
That also let's you ensure it was properly generated via strong random
source.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-27 Thread Sehrope Sarkuni
On Sat, Jul 27, 2019 at 1:32 PM Bruce Momjian  wrote:
> Uh, I listed the three options for the CRC and gave the benefits of
> each:
>
> 
> https://www.postgresql.org/message-id/20190725200343.xo4dcjm5azrfn...@momjian.us
>
> Obviously I was not clear on the benefits.  To quote:
>
> 1.  compute CRC and then encrypt everything
> 3.  encrypt and then CRC, and store the CRC encrypted
>
> Numbers 1 & 3 give us tampering detection, though with the CRC being so
> small, it isn't totally secure.
>
> > Are you worried about an attacker forging the page checksum by
> > installing another encrypted page that gives the same checksum?  I'm not
> > sure how that attack works ... I mean why can the attacker install
> > arbitrary pages?
>
> Well, with #2
>
> 2   encrypt and then CRC, and store the CRC unchanged
>
> you can modify the page, even small parts, and just replace the CRC to
> match your changes.  In #1 and #3, you would get a CRC error in almost
> all cases since you have no way of setting the decrypted CRC without
> knowing the key.  You can change the encrypted CRC, but the odds that
> the decrypted one would match the page is very slim.

Regarding #1 and #3, with CTR mode you do not need to know the key to
make changes to the CRC. Flipping bits of the encrypted CRC would flip
the same bits of the decrypted one. This was one of the issues with
the older WiFi encryption standard WEP[1] which used RC4 + CRC32. It's
not the exact same usage pattern, but I wouldn't be surprised if there
is a way to make in place updates and matching CRC32 changes even if
it's encrypted.

Given the non-cryptographic nature of CRC and its 16-bit size, I'd
round down the malicious tamper detection it provides to zero. At best
it catches random disk errors so might as well keep it in plain text
and checkable offline.

More generally, without a cryptographic MAC I don't think it's
possible to provide any meaningful malicious tamper detection. And
even that would have to be off-page to deal with page replay (which I
think is out of scope).

[1]: https://en.wikipedia.org/wiki/CRC-32#Data_integrity

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: fsync error handling in pg_receivewal, pg_recvlogical

2019-07-27 Thread Sehrope Sarkuni
While reviewing this patch I read through some of the other fsync
callsites and noticed this typo (walkdir is in file_utils.c, not
initdb.c) too:

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 315c74c745..9b79df2d7f 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3208,7 +3208,7 @@ SyncDataDirectory(void)
  *
  * Errors are reported at level elevel, which might be ERROR or less.
  *
- * See also walkdir in initdb.c, which is a frontend version of this logic.
+ * See also walkdir in file_utils.c, which is a frontend version of this logic.
  */
 static void
 walkdir(const char *path,

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: fsync error handling in pg_receivewal, pg_recvlogical

2019-07-27 Thread Sehrope Sarkuni
Hi,

Tried out this patch and it applies, compiles, and passes check-world. Also
flipped things around in pg_recvlogical.c to exit-on-success to ensure it's
actually being called and that worked too. Outside of a more complicated
harness that simulates fsync errors not sure how else to test this further.

I did some searching and found a FUSE based on that looks interesting:
CharybdeFS[1]. Rather than being fixed at mount time, it has a
client/server interface so you can change the handling of syscalls on the
fly[2]. For example you can error out fsync calls halfway through a test
rather than always or randomly. Haven't tried it out but leaving it here as
it seems relevant.

[1]: https://github.com/scylladb/charybdefs
[2]:
https://www.scylladb.com/2016/05/02/fault-injection-filesystem-cookbook/

On Wed, Jun 26, 2019 at 12:11 AM Michael Paquier 
wrote:

> Why using a different error code.  Using EXIT_FAILURE is a more common
> practice in the in-core binaries.  The patch looks fine to me except
> that, that's a good first cut.
>

An error code specific to fsync issues could help with tests as the harness
could check it to ensure things died for the right reasons. With a generic
"messed up fsync" harness you might even be able to run some existing tests
that would otherwise pass and check for the fsync-specific exit code.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Sehrope Sarkuni
On Thu, Jul 25, 2019 at 8:50 PM Bruce Momjian  wrote:

> On Thu, Jul 25, 2019 at 08:44:40PM -0400, Sehrope Sarkuni wrote:
> > You can still use CTR mode and include those to make the key + IV unique
> by
> > adding them to the derived key rather than the IV.
> >
> > The IV per-page would still be LSN + page-number (with the block number
> added
> > as it's evaluated across the page) and the relfilenode, heap/index,
> database,
> > and anything else to make it unique can be included in the HKDF to
> create the
> > per-file derived key.
>
> I thought if we didn't have to hash the stuff together we would be less
> likely to get collisions with the IV.
>

IV creation not use any hashing and would never have collisions with the
same key as it's LSN + page + block (concatenation).

The derived keys would also not have collisions as the HKDF prevents that.
Deriving two matching keys with different inputs has the same chance as
randomly generating matching HMACs (effectively nil with something like
HMAC-SHA-256).

So there wouldn't be any reuse of the same key + IV. Even if two different
files are encrypted with the same LSN + page the total operation (key + IV)
would be different as they'd be using different derived keys.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Sehrope Sarkuni
On Thu, Jul 25, 2019 at 7:51 PM Bruce Momjian  wrote:

> Looking at the bits we have, the IV for AES is 16 bytes.  Since we know
> we have to use LSN (to change the IV for each page write), and the page
> number (so WAL updates that change multiple pages with the same LSN use
> different IVs), that uses 12 bytes:
>
> LSN 8 bytes
> page-number 4 bytes
>
> That leaves 4 bytes unused.  If we use CTR, we need 11 bits for the
> counter to support 32k pages sizes (per Sehrope Sarkuni), and we can use
> the remaining 5 bits as constants to indicate heap, index, or WAL.
> (Technically, since we are not encrypting the first 16 bytes, we could
> use one less bit for the counter.)  If we also use relfilenode, that is
> 4 bytes, so we have no bits for the heap/index/WAL constant, and no
> space for the CTR counter, meaning we would have to use CBC mode.
>

You can still use CTR mode and include those to make the key + IV unique by
adding them to the derived key rather than the IV.

The IV per-page would still be LSN + page-number (with the block number
added as it's evaluated across the page) and the relfilenode, heap/index,
database, and anything else to make it unique can be included in the HKDF
to create the per-file derived key.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-20 Thread Sehrope Sarkuni
On Sat, Jul 20, 2019 at 1:30 PM Tomas Vondra
 wrote:
> Forbid checksums? I don't see how that could be acceptable. We either have
> to accept the limitations of the current design (having to decrypt
> everything before checking the checksums) or change the design.
>
> I personally think we should do the latter - not just because of this
> "decrypt-then-verify" issue, but consider how much work we've done to
> allow enabling checksums on-line (it's still not there, but it's likely
> doable in PG13). How are we going to do that with encryption? ISTM we
> should design it so that we can enable encryption on-line too - maybe not
> in v1, but it should be possible. So how are we going to do that? With
> checksums it's (fairly) easy because we can "not verify" the page before
> we know all pages have a checksum, but with encryption that's not
> possible. And if the whole page is encrypted, then what?
>
> Of course, maybe we don't need such capability for the use-case we're
> trying to solve with encryption. I can imagine that someone is running a
> large system, has issues with data corruption, and decides to enable
> checksums to remedy that. Maybe there's no such scenario in the privacy
> case? But we can probably come up with scenarios where a new company
> policy forces people to enable encryption on all systems, or something
> like that.
>
> That being said, I don't know how to solve this, but it seems to me that
> any system where we can't easily decide whether a page is encrypted or not
> (because everything including the page header) is encrypted has this
> exact issue. Maybe we could keep some part of the header unencrypted
> (likely an information leak, does not solve decrypt-then-verify). Or maybe
> we need to store some additional information on each page (which breaks
> on-disk format).

How about storing the CRC of the encrypted pages? It would not leak
any additional information and serves the same purpose as a
non-encrypted one, namely I/O corruption detection. I took a look at
pg_checksum and besides checking for empty pages, the checksum
validation path does not interpret any other fields to calculate the
checksum. I think even the offline checksum enabling path looks like
it may work out of the box. Checksums of encrypted data are not a
replacement for a MAC and this would allow that validation to run
without any knowledge of keys.

Related, I think CTR mode should be considered for pages. It has
performance advantages at 8K data sizes, but even better, allows for
arbitrary bytes of the cipher text to be replaced. For example, after
encrypting a block you can replace the two checksum bytes with the CRC
of the cipher text v.s. CBC mode where that would cause corruption to
cascade forward. Same could be used for leaving things like
pd_pagesize_version in plaintext at its current offset. For anything
deemed non-sensitive, having it readable without having to decrypt the
page is useful.

It does not have to be full bytes either. CTR mode operates as a
stream of bits so its possible to replace nibbles or even individual
bits. It can be something as small one bit for an "is_encrypted" flag
or a handful of bits used to infer a derived key. For example, with
2-bits you could have 00 represent unencrypted, 01/10 represent
old/new key, and 11 be future use. Something like that could
facilitate online key rotation.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-15 Thread Sehrope Sarkuni
Hi,

Some more thoughts on CBC vs CTR modes. There are a number of
advantages to using CTR mode for page encryption.

CTR encryption modes can be fully parallelized, whereas CBC can only
parallelized for decryption. While both can use AES specific hardware
such as AES-NI, CTR modes can go a step further and use vectorized
instructions.

On an i7-8559U (with AES-NI) I get a 4x speed improvement for
CTR-based modes vs CBC when run on 8K of data:

# openssl speed -evp ${cipher}
type 16 bytes 64 bytes256 bytes   1024 bytes
8192 bytes  16384 bytes
aes-128-cbc1024361.51k  1521249.60k  1562033.41k  1571663.87k
1574537.90k  1575512.75k
aes-128-ctr 696866.85k  2214441.86k  4364903.85k  5896221.35k
6559735.81k  6619594.75k
aes-128-gcm 642758.92k  1638619.09k  3212068.27k  5085193.22k
6366035.97k  6474006.53k
aes-256-cbc 940906.25k  1114628.44k  1131255.13k  1138385.92k
1140258.13k  1143592.28k
aes-256-ctr 582161.82k  1896409.32k  3216926.12k  4249708.20k
4680299.86k  4706375.00k
aes-256-gcm 553513.89k  1532556.16k  2705510.57k  3931744.94k
4615812.44k  4673093.63k

For relation data where the encryption is going to be per page,
there's flexibility in how the CTR nonce (IV + counter) is generated.
With an 8K page, the counter need only go up to 512 for each page
(8192-bytes per page / 16-bytes per AES-block). That would require
9-bits for the counter. Rounding that up to 16-bits allows for wider
pages and it still uses only two bytes of the counter while ensuring
that it'd be unique per AES-block. The remaining 14-bytes would be
populated with some other data that is guaranteed unique per
page-write to allow encryption via the same per-relation-file derived
key. From what I gather, the LSN is a candidate though it'd have to be
stored in plaintext for decryption.

What's important is that writing the two pages (either different
locations or the same page back again) never reuses the same nonce
with the same key. Using the same nonce with a different key is fine.

With any of these schemes the same inputs will generate the same
outputs. With CTR mode for WAL this would be an issue if the same key
and deterministic nonce (ex: LSN + offset) is reused in multiple
places. That does not have to be the same cluster either. For example
if two replicas are promoted from the same backup with the same master
key, they would generate the same WAL CTR stream, reusing the
key/nonce pair. Ditto for starting off with a master key and deriving
per-relation keys in a cloned installation off some deterministic
attribute such as oid.

This can be avoided by deriving new keys per file (not just per
relation) from a random salt. It'd be stored out of band and combined
with the master key to derive the specific key used for that CTR
stream. If there's a desire for supporting multiple ciphers or key
sizes, that could be stored alongside the salt. Perhaps use the same
location or lack of it to indicate "not encrypted" as well.

Per-file salts and derived keys would facilitate re-keying a table
piecemeal, file by file, by generating a new salt/derived-key,
encrypting a copy of the decrypted file, and doing an atomic rename.
The files contents would change but its length and any references to
pages or byte offsets would stay valid. (I think this would work for
CBC modes too as there's nothing CTR specific about it.)

I'm not sure of is how to handle randomizing the relation file IV in a
cloned database. Until the key for a relation file or segment is
rotated it'd have the same deterministic IV generated as its source as
the LSN would continue from the same point. One idea is with 128-bits
for the IV, one could have 64-bits for LSN, 16-bits for AES-block
counter, and the remaining 48-bits be randomized; though you'd need to
store those 48-bits somewhere per-page (basically it's a salt per
page). That'd give some protection from the clone's new data be
encrypted with the same stream as the parent's. Another option would
be to track ranges of LSNs and have a centralized list of 48-bit
randomized salts. That would remove the need for additional salt per
page though you'd have to do a lookup on that shared list to figure
out which to use.

CTR mode is definitely more complicated than a pure random-IV + CBC
but with any deterministic generation of IVs for CBC mode you're going
to have some of these same problems as well.

Regarding CRCs, CTR mode has the advantage of not destroying the rest
of the stream to replace the CRC bytes. With CBC mode any change would
cascade and corrupt the rest of data the down stream from that block.
With CTR mode you can overwrite the CRC's location with the CRC or a
truncated MAC of the encrypted data as each byte is encrypted
separately. At decryption time you simply ignore the decrypted output
of those bytes and zero them out again. A CRC of encrypted data (but
not a partial MAC) could be checked offline without access to the key.

Regards,
-- Sehrope Sarkuni
Fou

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-12 Thread Sehrope Sarkuni
On Thu, Jul 11, 2019 at 9:05 PM Bruce Momjian  wrote:
>
> On Thu, Jul 11, 2019 at 08:41:52PM -0400, Joe Conway wrote:
> > I vote for AES 256 rather than 128.
>
> Why?  This page seems to think 128 is sufficient:
>
> 
> https://crypto.stackexchange.com/questions/20/what-are-the-practical-differences-between-256-bit-192-bit-and-128-bit-aes-enc
>
> For practical purposes, 128-bit keys are sufficient to ensure 
> security.
> The larger key sizes exist mostly to satisfy some US military
> regulations which call for the existence of several distinct "security
> levels", regardless of whether breaking the lowest level is already 
> far
> beyond existing technology.
>
> We might need to run some benchmarks to determine the overhead of going
> to AES256, because I am unclear of the security value.

If the algorithm and key size is not going to be configurable then
better to lean toward the larger size, especially given the desire for
future proofing against standards evolution and potential for the
encrypted data to be very long lived. NIST recommends AES-128 or
higher but there are other publications that recommend AES-256 for
long term usage:

NIST - 2019 : Recommends AES-128, AES-192, or AES-256.
https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-131Ar2.pdf

NSA - 2016 : Recommends AES-256 for future quantum resistance.
https://apps.nsa.gov/iaarchive/library/ia-guidance/ia-solutions-for-classified/algorithm-guidance/cnsa-suite-and-quantum-computing-faq.cfm

ECRYPT - 2015 - Recommends AES-256 for future quantum resistance.
https://www.ecrypt.eu.org/csa/documents/PQC-whitepaper.pdf

ECRYPT - 2018 - Recommends AES-256 for long term use.
https://www.ecrypt.eu.org/csa/documents/D5.4-FinalAlgKeySizeProt.pdf

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Refactoring syslogger piping to simplify adding new log destinations

2019-07-10 Thread Sehrope Sarkuni
Hi,

While working on adding a new log_destination I noticed that the
syslogger piping would need to be updated. At the moment both ends
only handle stderr/csvlog as the pipe message header has a char
"is_last" that is either t/f (stderr last, stderr partial) or T/F
(csvlog last, csvlog partial). Couple approaches came to mind:

1. Use additional pairs of chars for each additional destination (e.g.
x/X, y/Y, ...) and mimic the logic of csvlog.
2. Repurpose the char "is_last" as a bitmap of the log destination
with the highest order bit indicating whether it's the last chunk.
3. Add a separate field "dest" for the log destination and leave
"is_last" as a t/f indicating whether it's the last chunk.

Attached are patches for each approach (fun exercise!). Also attached
is a basic TAP test to invoke the csvlog destination. It's a clone of
pg_ctl log rotation test that looks for .csv logs. If there's interest
in the test I was thinking of expanding it a bit to include "big"
output that would span multiple messages to test the partial/combining
path. My thoughts on the approaches:

#1 doesn't change the header types or size but seems ugly as it leads
to new pairs of constants and logic in multiple places. In particular,
both send and receive ends have to encode and decode the destination.
#2 is cleaner as there's a logical separation of the dest fields and
no need for new constant pairs when adding new destinations. Would
also need to ensure new LOG_DESTINATION_xyz constants do not use that
last bit (there's already four now so room for three more).
#3 leads to the cleanest code though you lose 4-bytes of max data size
per chunk.

Which would be preferable? I'd like to validate the approach as the
new log destination would be built atop it. I leaning toward #3 though
if the 4-byte loss is a deal breaker then #2.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
From 9ec38a587b0c2645bc9fd73398c1debdf9fa962b Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Wed, 10 Jul 2019 09:27:44 -0400
Subject: [PATCH 1/2] Add basic test for csvlog

---
 src/bin/pg_ctl/t/005_csvlog.pl | 94 ++
 1 file changed, 94 insertions(+)
 create mode 100644 src/bin/pg_ctl/t/005_csvlog.pl

diff --git a/src/bin/pg_ctl/t/005_csvlog.pl b/src/bin/pg_ctl/t/005_csvlog.pl
new file mode 100644
index 00..b74c373bb6
--- /dev/null
+++ b/src/bin/pg_ctl/t/005_csvlog.pl
@@ -0,0 +1,94 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 4;
+use Time::HiRes qw(usleep);
+
+# Set up node with logging collector
+my $node = get_new_node('primary');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq(
+logging_collector = on
+lc_messages = 'C'
+log_destination = 'csvlog'
+));
+
+$node->start();
+
+# Verify that log output gets to the file
+
+$node->psql('postgres', 'SELECT 1/0');
+
+my $current_logfiles = slurp_file($node->data_dir . '/current_logfiles');
+
+note "current_logfiles = $current_logfiles";
+
+like(
+	$current_logfiles,
+	qr|^csvlog log/postgresql-.*csv$|,
+	'current_logfiles is sane');
+
+my $lfname = $current_logfiles;
+$lfname =~ s/^csvlog //;
+chomp $lfname;
+
+# might need to retry if logging collector process is slow...
+my $max_attempts = 180 * 10;
+
+my $first_logfile;
+for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
+{
+	$first_logfile = slurp_file($node->data_dir . '/' . $lfname);
+	last if $first_logfile =~ m/division by zero/;
+	usleep(100_000);
+}
+
+like($first_logfile, qr/division by zero/, 'found expected log file content');
+
+# Sleep 2 seconds and ask for log rotation; this should result in
+# output into a different log file name.
+sleep(2);
+$node->logrotate();
+
+# pg_ctl logrotate doesn't wait for rotation request to be completed.
+# Allow a bit of time for it to happen.
+my $new_current_logfiles;
+for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
+{
+	$new_current_logfiles = slurp_file($node->data_dir . '/current_logfiles');
+	last if $new_current_logfiles ne $current_logfiles;
+	usleep(100_000);
+}
+
+note "now current_logfiles = $new_current_logfiles";
+
+like(
+	$new_current_logfiles,
+	qr|^csvlog log/postgresql-.*csv$|,
+	'new current_logfiles is sane');
+
+$lfname = $new_current_logfiles;
+$lfname =~ s/^csvlog //;
+chomp $lfname;
+
+# Verify that log output gets to this file, too
+
+$node->psql('postgres', 'fee fi fo fum');
+
+my $second_logfile;
+for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
+{
+	$second_logfile = slurp_file($node->data_dir . '/' . $lfname);
+	last if $second_logfile =~ m/syntax error/;
+	usleep(100_000);
+}
+
+like(
+	$second_logfile,
+	qr/syntax error/,
+	'found expected log file content in new log file');
+
+$node->stop();
-- 
2.17.1

From 7b9d827e5059445945d214388371138f0676b306 Mon Sep 17 00:00:00 2001
From: Sehro