Re: [PATCH] Simple progress reporting for COPY command

2021-01-08 Thread Bharath Rupireddy
On Fri, Jan 8, 2021 at 7:00 PM Matthias van de Meent
 wrote:
> On Thu, 7 Jan 2021 at 23:00, Josef Šimánek  wrote:
> >
> > čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra
> >  napsal:
> > >
> > > I'm not particularly attached to the "lines" naming, it just seemed OK
> > > to me. So if there's consensus to rename this somehow, I'm OK with it.
> >
> > The problem I do see here is it depends on the "way" of COPY. If
> > you're copying from CSV file to table, those are actually lines (since
> > 1 line = 1 tuple). But copying from DB to file is copying tuples (but
> > 1 tuple = 1 file line). Line works better here for me personally.
> >
> > Once I'll fix the problem with triggers (and also another cases if
> > found), I think we can consider it lines. It will represent amount of
> > lines processed from file on COPY FROM and amount of lines written to
> > file in COPY TO form (at least in CSV format). I'm not sure how BINARY
> > format works, I'll check.
>
> Counterexample that 1 tuple need not be 1 line, in csv/binary:

Yes in binary format file, 1 tuple is definitely not 1 line.

> /*
>  * create a table with one tuple containing 1 text field, which consists of
>  * 10 newline characters.
>  * If you want windows-style lines, replace '\x0A' (\n) with '\x0D0A' (\r\n).
>  */
> # CREATE TABLE ttab (val) AS
>   SELECT * FROM (values (
> repeat(convert_from(E'\x0A'::bytea, 'UTF8'), 10)::text
>   )) as v;
>
> # -- indeed, one unix-style line, according to $ wc -l copy.txt
> # COPY ttab TO 'copy.txt' (format text);
> COPY 1
> # TRUNCATE ttab; COPY ttab FROM 'copy.txt' (format text);
> COPY 1
>
> # -- 11 lines
> # COPY ttab TO 'copy.csv' (format csv);
> COPY 1
> # TRUNCATE ttab; COPY ttab FROM 'copy.csv' (format csv);
> COPY 1
>
> # -- 13 lines
> # COPY ttab TO 'copy.bin' (format binary);
> COPY 1
> # TRUNCATE ttab; COPY ttab FROM 'copy.bin' (format binary);
> COPY 1
>
> All of the above copy statements would only report 'lines_processed = 1',
> in the progress reporting, while csv/binary line counts are definatively
> inconsistent with what the progress reporting shows, because progress
> reporting counts tuples / table rows, not the amount of lines in the
> external file.

I agree with that. +1 to change the name of 'lines_processed' column
to 'tuples_processed'. So, the meaning of 'tuples_processed' can be -
for COPY FROM, it's the number of tuples that are written to the
target table, for COPY TO, it's the number of tuples from the source
table or source plan that are written out to the file/stdin.

IMHO, it's good to have at least the following info: the type of
command COPY TO or COPY FROM, the format CSV,BINARY,TEXT and from
where the input comes from or output goes to - FILE, STDIN, STDOUT or
PROGRAM. Otherwise users will have to join pg_stat_progress_copy view
with other views to get the type of copy command or the entire query.
That's sometimes cumbersome to figure out what's the other view name
and write join queries.

Another point related to documentation, I see that the
pg_stat_progress_copy view details are added to monitoring.sgml, but I
didn't find any mention of it in the copy.sgml. For instance, we have
pg_stat_progress_basebackup documentation both in monitoring.sgml and
a mention of it and link to it in pg_basebackup.sgml. Usually, users
will look at copy documentation to find any kind of information
related to it, so better we have it mentioned there as well.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Simple progress reporting for COPY command

2021-01-08 Thread Matthias van de Meent
On Thu, 7 Jan 2021 at 23:00, Josef Šimánek  wrote:
>
> čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra
>  napsal:
> >
> > I'm not particularly attached to the "lines" naming, it just seemed OK
> > to me. So if there's consensus to rename this somehow, I'm OK with it.
>
> The problem I do see here is it depends on the "way" of COPY. If
> you're copying from CSV file to table, those are actually lines (since
> 1 line = 1 tuple). But copying from DB to file is copying tuples (but
> 1 tuple = 1 file line). Line works better here for me personally.
>
> Once I'll fix the problem with triggers (and also another cases if
> found), I think we can consider it lines. It will represent amount of
> lines processed from file on COPY FROM and amount of lines written to
> file in COPY TO form (at least in CSV format). I'm not sure how BINARY
> format works, I'll check.

Counterexample that 1 tuple need not be 1 line, in csv/binary:

/*
 * create a table with one tuple containing 1 text field, which consists of
 * 10 newline characters.
 * If you want windows-style lines, replace '\x0A' (\n) with '\x0D0A' (\r\n).
 */
# CREATE TABLE ttab (val) AS
  SELECT * FROM (values (
repeat(convert_from(E'\x0A'::bytea, 'UTF8'), 10)::text
  )) as v;

# -- indeed, one unix-style line, according to $ wc -l copy.txt
# COPY ttab TO 'copy.txt' (format text);
COPY 1
# TRUNCATE ttab; COPY ttab FROM 'copy.txt' (format text);
COPY 1

# -- 11 lines
# COPY ttab TO 'copy.csv' (format csv);
COPY 1
# TRUNCATE ttab; COPY ttab FROM 'copy.csv' (format csv);
COPY 1

# -- 13 lines
# COPY ttab TO 'copy.bin' (format binary);
COPY 1
# TRUNCATE ttab; COPY ttab FROM 'copy.bin' (format binary);
COPY 1

All of the above copy statements would only report 'lines_processed = 1',
in the progress reporting, while csv/binary line counts are definatively
inconsistent with what the progress reporting shows, because progress
reporting counts tuples / table rows, not the amount of lines in the
external file.




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Amit Kapila
On Fri, Jan 8, 2021 at 9:45 AM Josef Šimánek  wrote:
>
> pá 8. 1. 2021 v 5:03 odesílatel Amit Kapila  napsal:
> >
> > On Fri, Jan 8, 2021 at 8:42 AM Josef Šimánek  
> > wrote:
> > >
> > > pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila  
> > > napsal:
> > > >
> > > >
> > > > Can't we display the entire COPY command? I checked that
> > > > pg_stat_statements display the query so there shouldn't be a problem
> > > > to display the entire command.
> > >
> > > In previous discussions there was mentioned it doesn't make sense
> > > since you can join with pg_stat_statements on the pid column if
> > > needed. What would be the reason to duplicate that info?
> > >
> >
> > But pg_stat_staments won't be present by default. Also, the same
> > argument could be applied for the command to be present in
> > stat_progress views. It occurred to me only when I was trying to
> > compare what we display in all the progress views. I think there is
> > some merit in being consistent.
>
> Sorry, I mean pg_stat_activity (not pg_stat_statements). That is
> included by default (at least in my installations).
>

Okay, that makes sense but I still wonder why we display it in other
stat_progress views?

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Josef Šimánek
pá 8. 1. 2021 v 5:03 odesílatel Amit Kapila  napsal:
>
> On Fri, Jan 8, 2021 at 8:42 AM Josef Šimánek  wrote:
> >
> > pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila  
> > napsal:
> > >
> > >
> > > Can't we display the entire COPY command? I checked that
> > > pg_stat_statements display the query so there shouldn't be a problem
> > > to display the entire command.
> >
> > In previous discussions there was mentioned it doesn't make sense
> > since you can join with pg_stat_statements on the pid column if
> > needed. What would be the reason to duplicate that info?
> >
>
> But pg_stat_staments won't be present by default. Also, the same
> argument could be applied for the command to be present in
> stat_progress views. It occurred to me only when I was trying to
> compare what we display in all the progress views. I think there is
> some merit in being consistent.

Sorry, I mean pg_stat_activity (not pg_stat_statements). That is
included by default (at least in my installations).

> --
> With Regards,
> Amit Kapila.




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Amit Kapila
On Fri, Jan 8, 2021 at 8:42 AM Josef Šimánek  wrote:
>
> pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila  napsal:
> >
> >
> > Can't we display the entire COPY command? I checked that
> > pg_stat_statements display the query so there shouldn't be a problem
> > to display the entire command.
>
> In previous discussions there was mentioned it doesn't make sense
> since you can join with pg_stat_statements on the pid column if
> needed. What would be the reason to duplicate that info?
>

But pg_stat_staments won't be present by default. Also, the same
argument could be applied for the command to be present in
stat_progress views. It occurred to me only when I was trying to
compare what we display in all the progress views. I think there is
some merit in being consistent.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Josef Šimánek
pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila  napsal:
>
> On Thu, Jan 7, 2021 at 7:02 PM Josef Šimánek  wrote:
> >
> > čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila  
> > napsal:
> > >
> > > On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
> > >  wrote:
> > > >
> > > > On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > > > > I'm attaching the whole patch since commitfest failed to ingest the
> > > > > last incremental on CI.
> > > > >
> > > >
> > > > Yeah, the whole patch needs to be attached for the commitfest tester to
> > > > work correctly - it can't apply pieces from multiple messages, etc.
> > > >
> > > > Anyway, I pushed this last version of patch, after a couple more tweaks,
> > > > mainly to the docs - one place used pg_stat_copy_progress, the section
> > > > was not indexed properly, and so on.
> > > >
> > >
> > > How about including a column for command similar to
> > > pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
> > > that command will be useful in the context of COPY as there are many
> > > variants of COPY.
> > >
> >
> > From pg_stat_progress_create_index docs:
> >
> > The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY,
> > REINDEX, or REINDEX CONCURRENTLY.
> >
> > Which values should be present in copy progress? Just COPY FROM or COPY TO?
> >
>
> Can't we display the entire COPY command? I checked that
> pg_stat_statements display the query so there shouldn't be a problem
> to display the entire command.

In previous discussions there was mentioned it doesn't make sense
since you can join with pg_stat_statements on the pid column if
needed. What would be the reason to duplicate that info?

> --
> With Regards,
> Amit Kapila.




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Amit Kapila
On Thu, Jan 7, 2021 at 7:02 PM Josef Šimánek  wrote:
>
> čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila  napsal:
> >
> > On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
> >  wrote:
> > >
> > > On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > > > I'm attaching the whole patch since commitfest failed to ingest the
> > > > last incremental on CI.
> > > >
> > >
> > > Yeah, the whole patch needs to be attached for the commitfest tester to
> > > work correctly - it can't apply pieces from multiple messages, etc.
> > >
> > > Anyway, I pushed this last version of patch, after a couple more tweaks,
> > > mainly to the docs - one place used pg_stat_copy_progress, the section
> > > was not indexed properly, and so on.
> > >
> >
> > How about including a column for command similar to
> > pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
> > that command will be useful in the context of COPY as there are many
> > variants of COPY.
> >
>
> From pg_stat_progress_create_index docs:
>
> The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY,
> REINDEX, or REINDEX CONCURRENTLY.
>
> Which values should be present in copy progress? Just COPY FROM or COPY TO?
>

Can't we display the entire COPY command? I checked that
pg_stat_statements display the query so there shouldn't be a problem
to display the entire command.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Josef Šimánek
čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra
 napsal:
>
>
>
> On 1/7/21 7:56 PM, Josef Šimánek wrote:
> > čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent
> >  napsal:
> >>
> >> On Wed, 6 Jan 2021 at 22:45, Tomas Vondra  
> >> wrote:
> >>>
> >>> On 1/5/21 11:02 AM, Josef Šimánek wrote:
>  I'm attaching the whole patch since commitfest failed to ingest the
>  last incremental on CI.
> 
> >>>
> >>> Yeah, the whole patch needs to be attached for the commitfest tester to
> >>> work correctly - it can't apply pieces from multiple messages, etc.
> >>>
> >>> Anyway, I pushed this last version of patch, after a couple more tweaks,
> >>> mainly to the docs - one place used pg_stat_copy_progress, the section
> >>> was not indexed properly, and so on.
> >>
> >> Thank you all, I'd love to use this in the future to keep track of
> >> (e.g.) my backup/restore progress.
> >>
> >> For my previously mentioned extension to keep track of filtered tuples
> >> that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
> >> of that, in line with the current column name style of lines.
> >
> > If I understand it well, this column could be used on COPY TO to track
> > skipped lines because of BEFORE TRIGGER, right? I can include this in
> > my following patch keeping lines_processed incremented even for
> > skipped lines as well.
> >
>
> That's my understanding too.
>
> >> If so desired, I'll split this off into a different thread & CF entry.
> >>
> >>> I see Matthias proposed to change "lines" to "tuples" - I only saw the
> >>> message after pushing, but I probably wouldn't make that change anyway.
> >>> The CSV docs seem to talk about lines, newlines etc. so it seems fine.
> >>> If not, we can change that.
> >>
> >> The CSV docs, sure. But copy doesn't only process CSVs; it also has
> >> text (which does have a # lines = # tuples / rows guarantee) and
> >> binary (in which the 'line' vocabulary doesn't make sense, and in
> >> which the 'tuples' vocabulary is used). Additionally, most mentions of
> >> postgres' logical rows/tuples in the COPY documentation use the 'rows'
> >> terminology ('tuples' for FORMAT BINARY), and use 'line' only for
> >> external format's textual representation's strings delimited by
> >> newlines (which I believe is not exactly what we're counting).
> >>
> >> One common user of COPY is the pg_dump tool, and that uses binary
> >> format by default (iirc).
> >>
> >> Additionally, all comments surrounding the *LINES_PROCESSED updates
> >> only mention 'tuples', so I'd like to strongly suggest (a variant of)
> >> attached patch 0002 to keep the vocabulary consistent by using
> >> 'tuples' instead of 'lines'.
> >>
>
> I'm not particularly attached to the "lines" naming, it just seemed OK
> to me. So if there's consensus to rename this somehow, I'm OK with it.

The problem I do see here is it depends on the "way" of COPY. If
you're copying from CSV file to table, those are actually lines (since
1 line = 1 tuple). But copying from DB to file is copying tuples (but
1 tuple = 1 file line). Line works better here for me personally.

Once I'll fix the problem with triggers (and also another cases if
found), I think we can consider it lines. It will represent amount of
lines processed from file on COPY FROM and amount of lines written to
file in COPY TO form (at least in CSV format). I'm not sure how BINARY
format works, I'll check.
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Tomas Vondra




On 1/7/21 7:56 PM, Josef Šimánek wrote:

čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent
 napsal:


On Wed, 6 Jan 2021 at 22:45, Tomas Vondra  wrote:


On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.



Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.


Thank you all, I'd love to use this in the future to keep track of
(e.g.) my backup/restore progress.

For my previously mentioned extension to keep track of filtered tuples
that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
of that, in line with the current column name style of lines.


If I understand it well, this column could be used on COPY TO to track
skipped lines because of BEFORE TRIGGER, right? I can include this in
my following patch keeping lines_processed incremented even for
skipped lines as well.



That's my understanding too.


If so desired, I'll split this off into a different thread & CF entry.


I see Matthias proposed to change "lines" to "tuples" - I only saw the
message after pushing, but I probably wouldn't make that change anyway.
The CSV docs seem to talk about lines, newlines etc. so it seems fine.
If not, we can change that.


The CSV docs, sure. But copy doesn't only process CSVs; it also has
text (which does have a # lines = # tuples / rows guarantee) and
binary (in which the 'line' vocabulary doesn't make sense, and in
which the 'tuples' vocabulary is used). Additionally, most mentions of
postgres' logical rows/tuples in the COPY documentation use the 'rows'
terminology ('tuples' for FORMAT BINARY), and use 'line' only for
external format's textual representation's strings delimited by
newlines (which I believe is not exactly what we're counting).

One common user of COPY is the pg_dump tool, and that uses binary
format by default (iirc).

Additionally, all comments surrounding the *LINES_PROCESSED updates
only mention 'tuples', so I'd like to strongly suggest (a variant of)
attached patch 0002 to keep the vocabulary consistent by using
'tuples' instead of 'lines'.



I'm not particularly attached to the "lines" naming, it just seemed OK 
to me. So if there's consensus to rename this somehow, I'm OK with it.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Josef Šimánek
čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent
 napsal:
>
> On Wed, 6 Jan 2021 at 22:45, Tomas Vondra  
> wrote:
> >
> > On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > > I'm attaching the whole patch since commitfest failed to ingest the
> > > last incremental on CI.
> > >
> >
> > Yeah, the whole patch needs to be attached for the commitfest tester to
> > work correctly - it can't apply pieces from multiple messages, etc.
> >
> > Anyway, I pushed this last version of patch, after a couple more tweaks,
> > mainly to the docs - one place used pg_stat_copy_progress, the section
> > was not indexed properly, and so on.
>
> Thank you all, I'd love to use this in the future to keep track of
> (e.g.) my backup/restore progress.
>
> For my previously mentioned extension to keep track of filtered tuples
> that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
> of that, in line with the current column name style of lines.

If I understand it well, this column could be used on COPY TO to track
skipped lines because of BEFORE TRIGGER, right? I can include this in
my following patch keeping lines_processed incremented even for
skipped lines as well.

> If so desired, I'll split this off into a different thread & CF entry.
>
> > I see Matthias proposed to change "lines" to "tuples" - I only saw the
> > message after pushing, but I probably wouldn't make that change anyway.
> > The CSV docs seem to talk about lines, newlines etc. so it seems fine.
> > If not, we can change that.
>
> The CSV docs, sure. But copy doesn't only process CSVs; it also has
> text (which does have a # lines = # tuples / rows guarantee) and
> binary (in which the 'line' vocabulary doesn't make sense, and in
> which the 'tuples' vocabulary is used). Additionally, most mentions of
> postgres' logical rows/tuples in the COPY documentation use the 'rows'
> terminology ('tuples' for FORMAT BINARY), and use 'line' only for
> external format's textual representation's strings delimited by
> newlines (which I believe is not exactly what we're counting).
>
> One common user of COPY is the pg_dump tool, and that uses binary
> format by default (iirc).
>
> Additionally, all comments surrounding the *LINES_PROCESSED updates
> only mention 'tuples', so I'd like to strongly suggest (a variant of)
> attached patch 0002 to keep the vocabulary consistent by using
> 'tuples' instead of 'lines'.
>
>
> With regards,
>
> Matthias van de Meent




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Matthias van de Meent
On Wed, 6 Jan 2021 at 22:45, Tomas Vondra  wrote:
>
> On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > I'm attaching the whole patch since commitfest failed to ingest the
> > last incremental on CI.
> >
>
> Yeah, the whole patch needs to be attached for the commitfest tester to
> work correctly - it can't apply pieces from multiple messages, etc.
>
> Anyway, I pushed this last version of patch, after a couple more tweaks,
> mainly to the docs - one place used pg_stat_copy_progress, the section
> was not indexed properly, and so on.

Thank you all, I'd love to use this in the future to keep track of
(e.g.) my backup/restore progress.

For my previously mentioned extension to keep track of filtered tuples
that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
of that, in line with the current column name style of lines.

If so desired, I'll split this off into a different thread & CF entry.

> I see Matthias proposed to change "lines" to "tuples" - I only saw the
> message after pushing, but I probably wouldn't make that change anyway.
> The CSV docs seem to talk about lines, newlines etc. so it seems fine.
> If not, we can change that.

The CSV docs, sure. But copy doesn't only process CSVs; it also has
text (which does have a # lines = # tuples / rows guarantee) and
binary (in which the 'line' vocabulary doesn't make sense, and in
which the 'tuples' vocabulary is used). Additionally, most mentions of
postgres' logical rows/tuples in the COPY documentation use the 'rows'
terminology ('tuples' for FORMAT BINARY), and use 'line' only for
external format's textual representation's strings delimited by
newlines (which I believe is not exactly what we're counting).

One common user of COPY is the pg_dump tool, and that uses binary
format by default (iirc).

Additionally, all comments surrounding the *LINES_PROCESSED updates
only mention 'tuples', so I'd like to strongly suggest (a variant of)
attached patch 0002 to keep the vocabulary consistent by using
'tuples' instead of 'lines'.


With regards,

Matthias van de Meent
From d988c86d60d2b49b52e547b42c644493a377d44a Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Thu, 7 Jan 2021 16:39:57 +0100
Subject: [PATCH v1 1/2] Add progress reporting for filtered rows.

COPY ... FROM ... WHERE (condition) may drop rows, which we now register as
explicitly being dropped.
---
 doc/src/sgml/monitoring.sgml | 10 ++
 src/backend/catalog/system_views.sql |  3 ++-
 src/backend/commands/copyfrom.c  |  5 +
 src/include/commands/progress.h  |  1 +
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 43fe8ae383..bf18ea23bd 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6498,6 +6498,16 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
Number of lines already processed by COPY command.
   
  
+
+ 
+  
+   lines_filtered bigint
+  
+  
+   Number of lines not processed due to being filtered by the
+   WHERE clause of the COPY command.
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5d89e77dbe..83d62c5286 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1124,7 +1124,8 @@ CREATE VIEW pg_stat_progress_copy AS
 S.relid AS relid,
 S.param1 AS bytes_processed,
 S.param2 AS bytes_total,
-S.param3 AS lines_processed
+S.param3 AS lines_processed,
+S.param4 AS lines_filtered
 FROM pg_stat_get_progress_info('COPY') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 08b6f782c7..212f4c67d6 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -540,6 +540,7 @@ CopyFrom(CopyFromState cstate)
 	CopyInsertMethod insertMethod;
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	uint64		processed = 0;
+	uint64		filtered = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
@@ -868,7 +869,11 @@ CopyFrom(CopyFromState cstate)
 			econtext->ecxt_scantuple = myslot;
 			/* Skip items that don't match COPY's WHERE clause */
 			if (!ExecQual(cstate->qualexpr, econtext))
+			{
+/* Report that this tuple was filtered out by the WHERE clause */
+pgstat_progress_update_param(PROGRESS_COPY_LINES_FILTERED, ++filtered);
 continue;
+			}
 		}
 
 		/* Determine the partition to insert the tuple into */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 95ec5d02e9..eb0f4e70d6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -137,5 +137,6 @@
 #define PROGRESS_COPY_BYTES_PROCESSED 0
 #define PROGRESS_COPY_BYTES_TOTAL 1
 #define 

Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Tomas Vondra

On 1/7/21 2:17 AM, Justin Pryzby wrote:

On Wed, Jan 06, 2021 at 10:44:49PM +0100, Tomas Vondra wrote:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.



Yeah, the whole patch needs to be attached for the commitfest tester to work
correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section was
not indexed properly, and so on.


Some more doc fixes.


Thanks, pushed.

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Josef Šimánek
čt 7. 1. 2021 v 16:54 odesílatel Tomas Vondra
 napsal:
>
>
>
> On 1/7/21 12:06 PM, Josef Šimánek wrote:
> > st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra
> >  napsal:
> >>
> >> On 1/5/21 11:02 AM, Josef Šimánek wrote:
> >>> I'm attaching the whole patch since commitfest failed to ingest the
> >>> last incremental on CI.
> >>>
> >>
> >> Yeah, the whole patch needs to be attached for the commitfest tester to
> >> work correctly - it can't apply pieces from multiple messages, etc.
> >>
> >> Anyway, I pushed this last version of patch, after a couple more tweaks,
> >> mainly to the docs - one place used pg_stat_copy_progress, the section
> >> was not indexed properly, and so on.
> >>
> >> I see Matthias proposed to change "lines" to "tuples" - I only saw the
> >> message after pushing, but I probably wouldn't make that change anyway.
> >> The CSV docs seem to talk about lines, newlines etc. so it seems fine.
> >> If not, we can change that.
> >>
> >> One more question, though - I now realize the lines_processed ignores
> >> rows skipped because of BEFORE INSERT triggers. I wonder if that's the
> >> right thing to do? Imagine you know the number of lines in a file. You
> >> can't really use (lines_processed / total_lines) to measure progress,
> >> because that may ignore many "skipped" rows. So maybe this should be
> >> changed to count all rows. OTOH we still have bytes_processed.
> >
> > I think that should be fixed. It is called "lines_processed" not
> > "lines_inserted". I'll take a look.
> >
>
> So we may either rename the column to "lines_inserted", or tweak the
> code to count all processed lines. Or track both and have two columns.

First I'll ensure lines_processed represents the actual amount of
processed lines. If reading from file and some lines are skipped due
to before insert trigger, I consider that one processed as well, even
if it is not inserted. If welcomed, I can add lines_inserted later as
well. But I'm not sure about the use case.

Also thanks to mentioning triggers, I think those could be used to
test the COPY progress (at least some variants). I'll check if I would
be able to add some test cases as well.

> regarss
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Tomas Vondra




On 1/7/21 12:06 PM, Josef Šimánek wrote:

st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra
 napsal:


On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.



Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

I see Matthias proposed to change "lines" to "tuples" - I only saw the
message after pushing, but I probably wouldn't make that change anyway.
The CSV docs seem to talk about lines, newlines etc. so it seems fine.
If not, we can change that.

One more question, though - I now realize the lines_processed ignores
rows skipped because of BEFORE INSERT triggers. I wonder if that's the
right thing to do? Imagine you know the number of lines in a file. You
can't really use (lines_processed / total_lines) to measure progress,
because that may ignore many "skipped" rows. So maybe this should be
changed to count all rows. OTOH we still have bytes_processed.


I think that should be fixed. It is called "lines_processed" not
"lines_inserted". I'll take a look.



So we may either rename the column to "lines_inserted", or tweak the 
code to count all processed lines. Or track both and have two columns.


regarss

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Josef Šimánek
čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila  napsal:
>
> On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
>  wrote:
> >
> > On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > > I'm attaching the whole patch since commitfest failed to ingest the
> > > last incremental on CI.
> > >
> >
> > Yeah, the whole patch needs to be attached for the commitfest tester to
> > work correctly - it can't apply pieces from multiple messages, etc.
> >
> > Anyway, I pushed this last version of patch, after a couple more tweaks,
> > mainly to the docs - one place used pg_stat_copy_progress, the section
> > was not indexed properly, and so on.
> >
>
> How about including a column for command similar to
> pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
> that command will be useful in the context of COPY as there are many
> variants of COPY.
>

>From pg_stat_progress_create_index docs:

The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY,
REINDEX, or REINDEX CONCURRENTLY.

Which values should be present in copy progress? Just COPY FROM or COPY TO?

> With Regards,
> Amit Kapila.




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Amit Kapila
On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
 wrote:
>
> On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > I'm attaching the whole patch since commitfest failed to ingest the
> > last incremental on CI.
> >
>
> Yeah, the whole patch needs to be attached for the commitfest tester to
> work correctly - it can't apply pieces from multiple messages, etc.
>
> Anyway, I pushed this last version of patch, after a couple more tweaks,
> mainly to the docs - one place used pg_stat_copy_progress, the section
> was not indexed properly, and so on.
>

How about including a column for command similar to
pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
that command will be useful in the context of COPY as there are many
variants of COPY.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Simple progress reporting for COPY command

2021-01-07 Thread Josef Šimánek
st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra
 napsal:
>
> On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > I'm attaching the whole patch since commitfest failed to ingest the
> > last incremental on CI.
> >
>
> Yeah, the whole patch needs to be attached for the commitfest tester to
> work correctly - it can't apply pieces from multiple messages, etc.
>
> Anyway, I pushed this last version of patch, after a couple more tweaks,
> mainly to the docs - one place used pg_stat_copy_progress, the section
> was not indexed properly, and so on.
>
> I see Matthias proposed to change "lines" to "tuples" - I only saw the
> message after pushing, but I probably wouldn't make that change anyway.
> The CSV docs seem to talk about lines, newlines etc. so it seems fine.
> If not, we can change that.
>
> One more question, though - I now realize the lines_processed ignores
> rows skipped because of BEFORE INSERT triggers. I wonder if that's the
> right thing to do? Imagine you know the number of lines in a file. You
> can't really use (lines_processed / total_lines) to measure progress,
> because that may ignore many "skipped" rows. So maybe this should be
> changed to count all rows. OTOH we still have bytes_processed.

I think that should be fixed. It is called "lines_processed" not
"lines_inserted". I'll take a look.

>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company




Re: [PATCH] Simple progress reporting for COPY command

2021-01-06 Thread Justin Pryzby
On Wed, Jan 06, 2021 at 10:44:49PM +0100, Tomas Vondra wrote:
> On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > I'm attaching the whole patch since commitfest failed to ingest the
> > last incremental on CI.
> > 
> 
> Yeah, the whole patch needs to be attached for the commitfest tester to work
> correctly - it can't apply pieces from multiple messages, etc.
> 
> Anyway, I pushed this last version of patch, after a couple more tweaks,
> mainly to the docs - one place used pg_stat_copy_progress, the section was
> not indexed properly, and so on.

Some more doc fixes.

>From 0616f448057905ab9b3218ebddfdd3af52e62bac Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 6 Jan 2021 19:08:11 -0600
Subject: [PATCH] doc review: COPY progress:
 8a4f618e7ae3cb11b0b37d0f06f05c8ff905833f

---
 doc/src/sgml/monitoring.sgml | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 43fe8ae383..3cdb1aff3c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6414,9 +6414,9 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   
Whenever COPY is running, the
pg_stat_progress_copy view will contain one row
-   for each backend that is currently running COPY command.
-   The table bellow describes the information that will be reported and provide
-   information how to interpret it.
+   for each backend that is currently running a COPY 
command.
+   The table below describes the information that will be reported and provides
+   information about how to interpret it.
   
 
   
@@ -6445,7 +6445,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
 
  
   
-   datid text
+   datid oid
   
   
OID of the database to which this backend is connected.
@@ -6467,7 +6467,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   
   
OID of the table on which the COPY command is 
executed.
-   It is set to 0 if SELECT query is provided.
+   It is set to 0 if copying from a SELECT query.
   
  
 
-- 
2.17.0

>From 0616f448057905ab9b3218ebddfdd3af52e62bac Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 6 Jan 2021 19:08:11 -0600
Subject: [PATCH] doc review: COPY progress:
 8a4f618e7ae3cb11b0b37d0f06f05c8ff905833f

---
 doc/src/sgml/monitoring.sgml | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 43fe8ae383..3cdb1aff3c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6414,9 +6414,9 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   
Whenever COPY is running, the
pg_stat_progress_copy view will contain one row
-   for each backend that is currently running COPY command.
-   The table bellow describes the information that will be reported and provide
-   information how to interpret it.
+   for each backend that is currently running a COPY command.
+   The table below describes the information that will be reported and provides
+   information about how to interpret it.
   
 
   
@@ -6445,7 +6445,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
 
  
   
-   datid text
+   datid oid
   
   
OID of the database to which this backend is connected.
@@ -6467,7 +6467,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   
   
OID of the table on which the COPY command is executed.
-   It is set to 0 if SELECT query is provided.
+   It is set to 0 if copying from a SELECT query.
   
  
 
-- 
2.17.0



Re: [PATCH] Simple progress reporting for COPY command

2021-01-06 Thread Tomas Vondra

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.



Yeah, the whole patch needs to be attached for the commitfest tester to 
work correctly - it can't apply pieces from multiple messages, etc.


Anyway, I pushed this last version of patch, after a couple more tweaks, 
mainly to the docs - one place used pg_stat_copy_progress, the section 
was not indexed properly, and so on.


I see Matthias proposed to change "lines" to "tuples" - I only saw the 
message after pushing, but I probably wouldn't make that change anyway. 
The CSV docs seem to talk about lines, newlines etc. so it seems fine. 
If not, we can change that.


One more question, though - I now realize the lines_processed ignores 
rows skipped because of BEFORE INSERT triggers. I wonder if that's the 
right thing to do? Imagine you know the number of lines in a file. You 
can't really use (lines_processed / total_lines) to measure progress, 
because that may ignore many "skipped" rows. So maybe this should be 
changed to count all rows. OTOH we still have bytes_processed.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Simple progress reporting for COPY command

2021-01-05 Thread Matthias van de Meent
On Fri, 1 Jan 2021 at 02:25, Josef Šimánek  wrote:
>
> Hello,
>
> finally I had some time to revisit patch and all comments from
>
https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com
> and I have prepared simple version of COPY command progress reporting.

+1

> To keep the patch small as possible, I have introduced only a minimum
> set of columns. It could be extended later if needed.

Seems reasonable. One potential extention I could think of is progress
reporting of tuples processed before/after applying the COPY's WHERE
clause, when and where applicable.

> Columns are inspired by CREATE INDEX progress report system view.
>
> pid - integer - PID of backend
> datid - oid - OID of related database
> datname - name - name of related database (this seems redundant, since
> oid should be enough, but it is the same in CREATE INDEX)
> relid - oid - oid of table related to COPY command, when not known
> (for example when copying to file, it is 0)
> bytes_processed - bigint - amount of bytes processed
> bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
> COPY FROM file)
> lines_processed - bigint - amount of tuples processed

Could you update the name of this column to be consistent with the
'tuples'-terminology used in the other progress reporting views, i.e.
lines_processed -> tuples_processed? That adds consistency, and also
disambiguates the meaning in case of e.g. COPY FROM (format CSV), as
multiline CSV fields do exist, and we're not necessarily counting lines
from or to a file.



With regards,


Matthias van de Meent


Re: [PATCH] Simple progress reporting for COPY command

2021-01-05 Thread Josef Šimánek
I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.


út 5. 1. 2021 v 2:32 odesílatel Josef Šimánek  napsal:
>
> út 5. 1. 2021 v 0:46 odesílatel Tomas Vondra
>  napsal:
> >
> > Hi,
> >
> > I did take a quick look today, and I have a couple minor comments:
>
> Hi! Thanks for your time.
>
> > 1) The catalog sgml docs seem to mention bytes_processed twice (one of
> > that should be bytes_total), and line_processed (should be "lines_").
>
> Fixed in attached patch.
>
> >
> > 2) I'm not quite sure about not including any info about the command.
> > For example pg_stat_progress_create_index includes info about the
> > command, although it's not very detailed. Not sure how useful would it
> > be to show just COPY FROM / COPY TO, without more details.
> >
> > It's probably possible to extract this from pg_stat_activity, but that
> > may be rather cumbersome (parsing arbitrary SQL and all that). Also,
> > what if the COPY is called from a function etc.?
>
> Any idea where to discuss this? My usecase is really simple. I would
> like to be able to see progress of COPY command by pid. There is a lot
> of COPY info inside CopyToStateData and CopyFromStateData structs. The
> only limitation I see is support of only int64 values for progress
> reporting columns. I'm not sure if it is easily possible to expose for
> example filename this way.
>
> >
> > 3) I wonder if we should have something like lines_estimated. For COPY
> > TO it's pretty simple to calculate it as
> >
> >  bytes_total / (bytes_processed / lines_processed)
> >
> > but what about
> >
> >  COPY (query) TO file
> >
> > In that case we don't know the total amount of bytes / rows, so we can't
> > calculate any estimate. So maybe we could peek into the query plan? But
> > I agree this is something we can add later.
>
> If I remember well one of the original ideas was to be able to pass
> custom bytes_total (or lines_total) by COPY options to be stored in
> copy progress. I can add this in some following patch if still
> welcomed.
>
> For estimates I would prefer to add an additional column to not mix
> those two together (or at least boolean estimated = true/false and
> reuse bytes_total column). If query estimates are welcomed, I can take
> a look at how to reach the query plan and expose those numbers when
> the query is used to estimated_lines and potentially estimated_bytes
> columns. It would be probably a little tricky to calculate
> estimated_bytes for some column types.
>
> Also currently only COPY FROM file supports bytes_total (it is 0 for
> all other scenarios). I think it should be possible for PostgreSQL to
> know the actual amount of lines query returns for some kind of
> queries, but I have no idea where to look at this. If that's possible
> to get, it would be one of the next steps to introduce additional
> column lines_total.
>
> >
> > 4) This comment is a bit confusing, as it mixes "total" and "processed".
> > I'd just say "number of bytes processed so far" instead.
> >
> Fixed in attached patch.
> >
> > Other than that, it seems fine. I'm sure we could add more features, but
> > it seems like a good start - I plan to get this committed once I get a
> > patch fixing the docs issues.
>
> Patch is attached, it should be applied to the top of the previous
> patch. Overall patch (having both patches merged together) could be
> found at 
> https://patch-diff.githubusercontent.com/raw/simi/postgres/pull/6.patch.
>
> > regards
> >
> > --
> > Tomas Vondra
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
From 847b227dac1ce2e9554a32ff95b8d618f8725843 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Josef=20=C5=A0im=C3=A1nek?= 
Date: Fri, 1 Jan 2021 01:14:47 +0100
Subject: [PATCH 1/2] Add pg_stat_progress_copy view with COPY progress report.

---
 doc/src/sgml/monitoring.sgml | 101 +++
 src/backend/catalog/system_views.sql |  11 +++
 src/backend/commands/copyfrom.c  |  16 +++-
 src/backend/commands/copyfromparse.c |   4 +
 src/backend/commands/copyto.c|  21 -
 src/backend/utils/adt/pgstatfuncs.c  |   2 +
 src/include/commands/copyfrom_internal.h |   1 +
 src/include/commands/progress.h  |   5 ++
 src/include/pgstat.h |   3 +-
 src/test/regress/expected/rules.out  |   9 ++
 10 files changed, 168 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3d6c90130677..51d261defd94 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -399,6 +399,12 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  pg_stat_progress_copypg_stat_progress_copy
+  One row for each backend running COPY, showing current progress.
+   See .
+  
+ 
 

   
@@ -5247,6 +5253,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
which support 

Re: [PATCH] Simple progress reporting for COPY command

2021-01-04 Thread Josef Šimánek
út 5. 1. 2021 v 0:46 odesílatel Tomas Vondra
 napsal:
>
> Hi,
>
> I did take a quick look today, and I have a couple minor comments:

Hi! Thanks for your time.

> 1) The catalog sgml docs seem to mention bytes_processed twice (one of
> that should be bytes_total), and line_processed (should be "lines_").

Fixed in attached patch.

>
> 2) I'm not quite sure about not including any info about the command.
> For example pg_stat_progress_create_index includes info about the
> command, although it's not very detailed. Not sure how useful would it
> be to show just COPY FROM / COPY TO, without more details.
>
> It's probably possible to extract this from pg_stat_activity, but that
> may be rather cumbersome (parsing arbitrary SQL and all that). Also,
> what if the COPY is called from a function etc.?

Any idea where to discuss this? My usecase is really simple. I would
like to be able to see progress of COPY command by pid. There is a lot
of COPY info inside CopyToStateData and CopyFromStateData structs. The
only limitation I see is support of only int64 values for progress
reporting columns. I'm not sure if it is easily possible to expose for
example filename this way.

>
> 3) I wonder if we should have something like lines_estimated. For COPY
> TO it's pretty simple to calculate it as
>
>  bytes_total / (bytes_processed / lines_processed)
>
> but what about
>
>  COPY (query) TO file
>
> In that case we don't know the total amount of bytes / rows, so we can't
> calculate any estimate. So maybe we could peek into the query plan? But
> I agree this is something we can add later.

If I remember well one of the original ideas was to be able to pass
custom bytes_total (or lines_total) by COPY options to be stored in
copy progress. I can add this in some following patch if still
welcomed.

For estimates I would prefer to add an additional column to not mix
those two together (or at least boolean estimated = true/false and
reuse bytes_total column). If query estimates are welcomed, I can take
a look at how to reach the query plan and expose those numbers when
the query is used to estimated_lines and potentially estimated_bytes
columns. It would be probably a little tricky to calculate
estimated_bytes for some column types.

Also currently only COPY FROM file supports bytes_total (it is 0 for
all other scenarios). I think it should be possible for PostgreSQL to
know the actual amount of lines query returns for some kind of
queries, but I have no idea where to look at this. If that's possible
to get, it would be one of the next steps to introduce additional
column lines_total.

>
> 4) This comment is a bit confusing, as it mixes "total" and "processed".
> I'd just say "number of bytes processed so far" instead.
>
Fixed in attached patch.
>
> Other than that, it seems fine. I'm sure we could add more features, but
> it seems like a good start - I plan to get this committed once I get a
> patch fixing the docs issues.

Patch is attached, it should be applied to the top of the previous
patch. Overall patch (having both patches merged together) could be
found at 
https://patch-diff.githubusercontent.com/raw/simi/postgres/pull/6.patch.

> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
From 6d2ee68b227c05ce4d1eb95a4c4a9c4f7dd6fbfa Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Josef=20=C5=A0im=C3=A1nek?= 
Date: Tue, 5 Jan 2021 02:07:03 +0100
Subject: [PATCH] Fix docs and comment.

---
 doc/src/sgml/monitoring.sgml | 4 ++--
 src/include/commands/copyfrom_internal.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 51d261defd94..875133303e19 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6477,7 +6477,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
 
  
   
-   bytes_processed bigint
+   bytes_total bigint
   
   
Size of source file for COPY FROM command in bytes. It is set to 0 if not available.
@@ -6486,7 +6486,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
 
  
   
-   line_processed bigint
+   lines_processed bigint
   
   
Number of lines already processed by COPY command.
diff --git a/src/include/commands/copyfrom_internal.h b/src/include/commands/copyfrom_internal.h
index ae76be295a9b..80fac1e58a12 100644
--- a/src/include/commands/copyfrom_internal.h
+++ b/src/include/commands/copyfrom_internal.h
@@ -154,7 +154,7 @@ typedef struct CopyFromStateData
 	char	   *raw_buf;
 	int			raw_buf_index;	/* next byte to process */
 	int			raw_buf_len;	/* total # of bytes stored */
-	uint64  bytes_processed;/* total # of bytes processed, used for progress reporting */
+	uint64  bytes_processed;/* number of bytes processed so far */
 	/* Shorthand for number of unconsumed bytes available in raw_buf */
 #define RAW_BUF_BYTES(cstate) 

Re: [PATCH] Simple progress reporting for COPY command

2021-01-04 Thread Tomas Vondra

Hi,

I did take a quick look today, and I have a couple minor comments:

1) The catalog sgml docs seem to mention bytes_processed twice (one of 
that should be bytes_total), and line_processed (should be "lines_").



2) I'm not quite sure about not including any info about the command. 
For example pg_stat_progress_create_index includes info about the 
command, although it's not very detailed. Not sure how useful would it 
be to show just COPY FROM / COPY TO, without more details.


It's probably possible to extract this from pg_stat_activity, but that 
may be rather cumbersome (parsing arbitrary SQL and all that). Also, 
what if the COPY is called from a function etc.?



3) I wonder if we should have something like lines_estimated. For COPY 
TO it's pretty simple to calculate it as


bytes_total / (bytes_processed / lines_processed)

but what about

COPY (query) TO file

In that case we don't know the total amount of bytes / rows, so we can't 
calculate any estimate. So maybe we could peek into the query plan? But 
I agree this is something we can add later.



4) This comment is a bit confusing, as it mixes "total" and "processed". 
I'd just say "number of bytes processed so far" instead.




Other than that, it seems fine. I'm sure we could add more features, but 
it seems like a good start - I plan to get this committed once I get a 
patch fixing the docs issues.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Simple progress reporting for COPY command

2021-01-01 Thread Josef Šimánek
pá 1. 1. 2021 v 11:16 odesílatel Bharath Rupireddy
 napsal:
>
> On Fri, Jan 1, 2021 at 6:55 AM Josef Šimánek  wrote:
> > finally I had some time to revisit patch and all comments from
> > https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com
> > and I have prepared simple version of COPY command progress reporting.
>
> Thanks for the patch.
>
> > To keep the patch small as possible, I have introduced only a minimum
> > set of columns. It could be extended later if needed.
> >
> > Columns are inspired by CREATE INDEX progress report system view.
> >
> > pid - integer - PID of backend
> > datid - oid - OID of related database
> > datname - name - name of related database (this seems redundant, since
> > oid should be enough, but it is the same in CREATE INDEX)
> > relid - oid - oid of table related to COPY command, when not known
> > (for example when copying to file, it is 0)
> > bytes_processed - bigint - amount of bytes processed
> > bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
> > COPY FROM file)
>
> It means that, for COPY FROM STDIN, bytes_total will be 0. IIUC, so
> having this parameter as 0 is an indication of the COPY command being
> from STDIN? I'm not sure whether it's discussed or not previously, but
> I personally prefer to have it as a separate column, say a varchar or
> text column with values STDIN, FILE or PROGRAM may be. And also it
> will be better if we have the format of the data streaming in, as a
> separate column, with possible values may be CSV, TEXT, BINARY.
>
> Since this progress reporting is for both COPY FROM and COPY TO,
> wouldn't it be good to have that also as a separate column, say
> command type with values FROM and TO?
>
> Thoughts?

My first patch had more columns (direction - FROM/TO, file bool,
program bool), but it was subject of discussion what's the purpose.
You can check the query by looking at pg_stat_activity by pid to get
more info. To keep it simple I decided to go with a minimal set of
columns for now. It can be extended later. I'm ok to continue on
improving this with more feedback once merged.

> I'm sure some of the points would have already been discussed. I just
> shared my thoughts here.
>
> I have not looked into the patch in detail, but it's missing tests.
> I'm sure we can not add the tests into copy.sql or copy2.sql, can we
> think of adding test cases to TAP or under isolation? I'm not sure
> whether other progress reporting commands have test cases.

I have raised the question in an old thread as well since there are no
tests for other progress commands as well. I was told it is ok for now
to keep it untested, since there's no easy way to do that for now.

https://www.postgresql.org/message-id/20200615001828.GA52676%40paquier.xyz

> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Simple progress reporting for COPY command

2021-01-01 Thread Bharath Rupireddy
On Fri, Jan 1, 2021 at 6:55 AM Josef Šimánek  wrote:
> finally I had some time to revisit patch and all comments from
> https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com
> and I have prepared simple version of COPY command progress reporting.

Thanks for the patch.

> To keep the patch small as possible, I have introduced only a minimum
> set of columns. It could be extended later if needed.
>
> Columns are inspired by CREATE INDEX progress report system view.
>
> pid - integer - PID of backend
> datid - oid - OID of related database
> datname - name - name of related database (this seems redundant, since
> oid should be enough, but it is the same in CREATE INDEX)
> relid - oid - oid of table related to COPY command, when not known
> (for example when copying to file, it is 0)
> bytes_processed - bigint - amount of bytes processed
> bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
> COPY FROM file)

It means that, for COPY FROM STDIN, bytes_total will be 0. IIUC, so
having this parameter as 0 is an indication of the COPY command being
from STDIN? I'm not sure whether it's discussed or not previously, but
I personally prefer to have it as a separate column, say a varchar or
text column with values STDIN, FILE or PROGRAM may be. And also it
will be better if we have the format of the data streaming in, as a
separate column, with possible values may be CSV, TEXT, BINARY.

Since this progress reporting is for both COPY FROM and COPY TO,
wouldn't it be good to have that also as a separate column, say
command type with values FROM and TO?

Thoughts?

I'm sure some of the points would have already been discussed. I just
shared my thoughts here.

I have not looked into the patch in detail, but it's missing tests.
I'm sure we can not add the tests into copy.sql or copy2.sql, can we
think of adding test cases to TAP or under isolation? I'm not sure
whether other progress reporting commands have test cases.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




[PATCH] Simple progress reporting for COPY command

2020-12-31 Thread Josef Šimánek
Hello,

finally I had some time to revisit patch and all comments from
https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com
and I have prepared simple version of COPY command progress reporting.

To keep the patch small as possible, I have introduced only a minimum
set of columns. It could be extended later if needed.

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since
oid should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known
(for example when copying to file, it is 0)
bytes_processed - bigint - amount of bytes processed
bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
COPY FROM file)
lines_processed - bigint - amount of tuples processed

example output of progress for common use case (import from CSV file):

first console:
yr=# COPY test FROM '/home/retro/test.csv' (FORMAT CSV);

second console:
yr=# SELECT * FROM pg_stat_progress_copy;
  pid   | datid | datname | relid | bytes_processed | bytes_total |
lines_processed
+---+-+---+-+-+-
 803148 | 16384 | yr  | 16394 |   998965248 |  177796 |
56730126
(1 row)

It is simple to get progress in percents for example by:

yr=# SELECT (bytes_processed/bytes_total::decimal)*100 FROM
pg_stat_progress_copy WHERE pid = 803148;
?column?
-
 50.04287948706048525800

^ ~50% of file processed already

I did some dead simple benchmarking as well. The difference is not
huge. Each command works with 100 millions of tuples. Times are in
seconds.

   test with progress   master (32d6287)   difference
 - --- -- 
  COPY table TO46.102 47.499   -1.397
  COPY query TO52.168 49.8222.346
  COPY table TO PROGRAM52.345 51.8820.463
  COPY query TO PROGRAM54.141 52.7631.378
  COPY table FROM  88.970 85.1613.809
  COPY table FROM PROGRAM  94.393 90.3464.047

Properly formatted table (since I'm not sure everyone here would be
able to see the table formatted well) and the benchmark source is
present at https://github.com/simi/postgres/pull/6. I have also
included an example output in there.

I'll add this to the current commitfest as well.
From 847b227dac1ce2e9554a32ff95b8d618f8725843 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Josef=20=C5=A0im=C3=A1nek?= 
Date: Fri, 1 Jan 2021 01:14:47 +0100
Subject: [PATCH] Add pg_stat_progress_copy view with COPY progress report.

---
 doc/src/sgml/monitoring.sgml | 101 +++
 src/backend/catalog/system_views.sql |  11 +++
 src/backend/commands/copyfrom.c  |  16 +++-
 src/backend/commands/copyfromparse.c |   4 +
 src/backend/commands/copyto.c|  21 -
 src/backend/utils/adt/pgstatfuncs.c  |   2 +
 src/include/commands/copyfrom_internal.h |   1 +
 src/include/commands/progress.h  |   5 ++
 src/include/pgstat.h |   3 +-
 src/test/regress/expected/rules.out  |   9 ++
 10 files changed, 168 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3d6c90130677..51d261defd94 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -399,6 +399,12 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  pg_stat_progress_copypg_stat_progress_copy
+  One row for each backend running COPY, showing current progress.
+   See .
+  
+ 
 

   
@@ -5247,6 +5253,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
which support progress reporting are ANALYZE,
CLUSTER,
CREATE INDEX, VACUUM,
+   COPY,
and  (i.e., replication
command that  issues to take
a base backup).
@@ -6396,6 +6403,100 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   
 
  
+
+ 
+  COPY Progress Reporting
+
+  
+   Whenever COPY is running, the
+   pg_stat_copy_progress view will contain one row
+   for each backend that is currently running COPY command.
+   The table bellow describes the information that will be reported and provide
+   information how to interpret it.
+  
+
+  
+   pg_stat_progress_copy View
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   pid integer
+  
+  
+   Process ID of backend.
+  
+ 
+
+ 
+  
+   datid text
+  
+  
+   OID of the database to which this backend is