Re: Parallel copy

2022-03-06 Thread Bharath Rupireddy
On Mon, Dec 28, 2020 at 3:14 PM vignesh C  wrote:
>
> Attached is a patch that was used for the same. The patch is written
> on top of the parallel copy patch.
> The design Amit, Andres & myself voted for that is the leader
> identifying the line bound design and sharing it in shared memory is
> performing better.

Hi Hackers, I see following are some of the problem with parallel copy feature:

1) Leader identifying the line/tuple boundaries from the file, letting
the workers pick, insert parallelly vs leader reading the file and
letting workers identify line/tuple boundaries, insert
2) Determining parallel safety of partitioned tables
3) Bulk extension of relation while inserting i.e. adding more than
one extra blocks to the relation in RelationAddExtraBlocks

Please let me know if I'm missing anything.

For (1) - from Vignesh's experiments above, it shows that the " leader
identifying the line/tuple boundaries from the file, letting the
workers pick, insert parallelly" fares better.
For (2) - while it's being discussed in another thread (I'm not sure
what's the status of that thread), how about we take this feature
without the support for partitioned tables i.e. parallel copy is
disabled for partitioned tables? Once the other discussion gets to a
logical end, we can come back and enable parallel copy for partitioned
tables.
For (3) - we need a way to extend or add new blocks fastly - fallocate
might help here, not sure who's working on it, others can comment
better here.

Can we take the "parallel copy" feature forward of course with some
restrictions in place?

Thoughts?

Regards,
Bharath Rupireddy.




Re: Parallel copy

2020-12-28 Thread vignesh C
On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  wrote:
>
> On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas  wrote:
> >
> > On 02/11/2020 08:14, Amit Kapila wrote:
> > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  
> > > wrote:
> > >>
> > >> In this design, you don't need to keep line boundaries in shared memory,
> > >> because each worker process is responsible for finding the line
> > >> boundaries of its own block.
> > >>
> > >> There's a point of serialization here, in that the next block cannot be
> > >> processed, until the worker working on the previous block has finished
> > >> scanning the EOLs, and set the starting position on the next block,
> > >> putting it in READY state. That's not very different from your patch,
> > >> where you had a similar point of serialization because the leader
> > >> scanned the EOLs,
> > >
> > > But in the design (single producer multiple consumer) used by the
> > > patch the worker doesn't need to wait till the complete block is
> > > processed, it can start processing the lines already found. This will
> > > also allow workers to start much earlier to process the data as it
> > > doesn't need to wait for all the offsets corresponding to 64K block
> > > ready. However, in the design where each worker is processing the 64K
> > > block, it can lead to much longer waits. I think this will impact the
> > > Copy STDIN case more where in most cases (200-300 bytes tuples) we
> > > receive line-by-line from client and find the line-endings by leader.
> > > If the leader doesn't find the line-endings the workers need to wait
> > > till the leader fill the entire 64K chunk, OTOH, with current approach
> > > the worker can start as soon as leader is able to populate some
> > > minimum number of line-endings
> >
> > You can use a smaller block size.
> >
>
> Sure, but the same problem can happen if the last line in that block
> is too long and we need to peek into the next block. And then there
> could be cases where a single line could be greater than 64K.
>
> > However, the point of parallel copy is
> > to maximize bandwidth.
> >
>
> Okay, but this first-phase (finding the line boundaries) can anyway be
> not done in parallel and we have seen in some of the initial
> benchmarking that this initial phase is a small part of work
> especially when the table has indexes, constraints, etc. So, I think
> it won't matter much if this splitting is done in a single process or
> multiple processes.
>

I wrote a patch to compare the performance of the current
implementation leader identifying the line bound design vs the workers
identifying the line boundary. The results of the same is given below:
The below data can be read as parallel copy time taken in seconds
based on the leader identifying the line boundary design, parallel
copy time taken in seconds based on the workers identifying the line
boundary design, workers.

Use case 1 - 10million rows, 5.2GB data,3 indexes on integer columns:
(211.206, 632.583, 1), (165.402, 360.152, 2), (137.608, 219.623, 4),
(128.003, 206.851, 8), (114.518, 177.790, 16), (109.257, 170.058, 20),
(102.050, 158.376, 30)

Use case 2 - 10million rows, 5.2GB data,2 indexes on integer columns,
1 index on text column, csv file:
(1212.356, 1602.118, 1), (707.191, 849.105, 2), (369.620, 441.068, 4),
(221.359, 252.775, 8), (167.152, 180.207, 16), (168.804, 181.986, 20),
(172.320, 194.875, 30)

Use case 3 - 10million rows, 5.2GB data without index:
(96.317, 437.453, 1), (70.730, 240.517, 2), (64.436, 197.604, 4),
(67.186, 175.630, 8), (76.561, 156.015, 16), (81.025, 150.687, 20),
(86.578, 148.481, 30)

Use case 4 - 1 records, 9.6GB, toast data:
(147.076, 276.323, 1), (101.610, 141.893, 2), (100.703, 134.096, 4),
(112.583, 134.765, 8), (101.898, 135.789, 16), (109.258, 135.625, 20),
(109.219, 136.144, 30)

Attached is a patch that was used for the same. The patch is written
on top of the parallel copy patch.
The design Amit, Andres & myself voted for that is the leader
identifying the line bound design and sharing it in shared memory is
performing better.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From dd9b6be19573b5391d01373b53e64a5c1dc305fd Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Mon, 28 Dec 2020 15:00:48 +0530
Subject: [PATCH v12 7/7] Parallel copy based on workers identifying line
 boundary.

Parallel copy based on workers identifying line boundary.
---
 src/backend/commands/copyfromparse.c |  93 +++
 src/backend/commands/copyparallel.c  | 441 +--
 src/include/commands/copyfrom_internal.h |  38 ++-
 src/test/reg

Re: Parallel copy

2020-12-26 Thread vignesh C
On Wed, Dec 23, 2020 at 3:05 PM Hou, Zhijie  wrote:
>
> Hi
>
> > Yes this optimization can be done, I will handle this in the next patch
> > set.
> >
>
> I have a suggestion for the parallel safety-check.
>
> As designed, The leader does not participate in the insertion of data.
> If User use (PARALLEL 1), there is only one worker process which will do the 
> insertion.
>
> IMO, we can skip some of the safety-check in this case, becase the 
> safety-check is to limit parallel insert.
> (except temporary table or ...)
>
> So, how about checking (PARALLEL 1) separately ?
> Although it looks a bit complicated, But (PARALLEL 1) do have a good 
> performance improvement.
>

Thanks for the comments Hou Zhijie, I will run a few tests with 1
worker and try to include this in the next patch set.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




RE: Parallel copy

2020-12-23 Thread Hou, Zhijie
Hi

> Yes this optimization can be done, I will handle this in the next patch
> set.
> 

I have a suggestion for the parallel safety-check.

As designed, The leader does not participate in the insertion of data.
If User use (PARALLEL 1), there is only one worker process which will do the 
insertion.

IMO, we can skip some of the safety-check in this case, becase the safety-check 
is to limit parallel insert.
(except temporary table or ...)

So, how about checking (PARALLEL 1) separately ?
Although it looks a bit complicated, But (PARALLEL 1) do have a good 
performance improvement.

Best regards,
houzj




Re: Parallel copy

2020-12-09 Thread vignesh C
On Mon, Dec 7, 2020 at 3:00 PM Hou, Zhijie  wrote:
>
> > Attached v11 patch has the fix for this, it also includes the changes to
> > rebase on top of head.
>
> Thanks for the explanation.
>
> I think there is still chances we can know the size.
>
> +* line_size will be set. Read the line_size again to be sure 
> if it is
> +* completed or partial block.
> +*/
> +   dataSize = pg_atomic_read_u32(>line_size);
> +   if (dataSize != -1)
> +   {
>
> If I am not wrong, this seems the branch that procsssing the populated block.
> I think we can check the copiedSize here, if copiedSize == 0, that means
> Datasizes is the size of the whole line and in this case we can do the 
> enlarge.
>
>

Yes this optimization can be done, I will handle this in the next patch set.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




RE: Parallel copy

2020-12-07 Thread Hou, Zhijie
> > 4.
> > A suggestion for CacheLineInfo.
> >
> > It use appendBinaryStringXXX to store the line in memory.
> > appendBinaryStringXXX will double the str memory when there is no enough
> spaces.
> >
> > How about call enlargeStringInfo in advance, if we already know the whole
> line size?
> > It can avoid some memory waste and may impove a little performance.
> >
> 
> Here we will not know the size beforehand, in some cases we will start
> processing the data when current block is populated and keep processing
> block by block, we will come to know of the size at the end. We cannot use
> enlargeStringInfo because of this.
> 
> Attached v11 patch has the fix for this, it also includes the changes to
> rebase on top of head.

Thanks for the explanation.

I think there is still chances we can know the size.

+* line_size will be set. Read the line_size again to be sure 
if it is
+* completed or partial block.
+*/
+   dataSize = pg_atomic_read_u32(>line_size);
+   if (dataSize != -1)
+   {

If I am not wrong, this seems the branch that procsssing the populated block.
I think we can check the copiedSize here, if copiedSize == 0, that means
Datasizes is the size of the whole line and in this case we can do the enlarge.


Best regards,
houzj






RE: Parallel copy

2020-11-19 Thread Hou, Zhijie
Hi Vignesh,

I took a look at the v10 patch set. Here are some comments:

1. 
+/*
+ * CheckExprParallelSafety
+ *
+ * Determine if where cluase and default expressions are parallel safe & do not
+ * have volatile expressions, return true if condition satisfies else return
+ * false.
+ */

'cluase' seems a typo.


2.
+   /*
+* Make sure that no worker has consumed this element, 
if this
+* line is spread across multiple data blocks, worker 
would have
+* started processing, no need to change the state to
+* LINE_LEADER_POPULATING in this case.
+*/
+   (void) 
pg_atomic_compare_exchange_u32(>line_state,
+   
  _line_state,
+   
  LINE_LEADER_POPULATED);
About the commect

+* started processing, no need to change the state to
+* LINE_LEADER_POPULATING in this case.

Does it means no need to change the state to LINE_LEADER_POPULATED ' here?


3.
+ * 3) only one worker should choose one line for processing, this is handled by
+ *using pg_atomic_compare_exchange_u32, worker will change the state to
+ *LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.

In the latest patch, it will set the state to LINE_WORKER_PROCESSING if 
line_state is LINE_LEADER_POPULATED or LINE_LEADER_POPULATING.
So The comment here seems wrong.


4.
A suggestion for CacheLineInfo.

It use appendBinaryStringXXX to store the line in memory.
appendBinaryStringXXX will double the str memory when there is no enough spaces.

How about call enlargeStringInfo in advance, if we already know the whole line 
size?
It can avoid some memory waste and may impove a little performance.


Best regards,
houzj






Re: Parallel copy

2020-11-18 Thread vignesh C
On Sat, Nov 7, 2020 at 7:01 PM vignesh C  wrote:
>
> On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie 
wrote:
> >
> > Hi
> >
> > >
> > > my $bytes = $ARGV[0];
> > > for(my $i = 0; $i < $bytes; $i+=8){
> > >  print "longdata";
> > > }
> > > print "\n";
> > > 
> > >
> > > postgres=# copy longdata from program 'perl /tmp/longdata.pl
1'
> > > with (parallel 2);
> > >
> > > This gets stuck forever (or at least I didn't have the patience to
wait
> > > it finish). Both worker processes are consuming 100% of CPU.
> >
> > I had a look over this problem.
> >
> > the ParallelCopyDataBlock has size limit:
> > uint8   skip_bytes;
> > chardata[DATA_BLOCK_SIZE];  /* data read from file
*/
> >
> > It seems the input line is so long that the leader process run out of
the Shared memory among parallel copy workers.
> > And the leader process keep waiting free block.
> >
> > For the worker process, it wait util line_state becomes
LINE_LEADER_POPULATED,
> > But leader process won't set the line_state unless it read the whole
line.
> >
> > So it stuck forever.
> > May be we should reconsider about this situation.
> >
> > The stack is as follows:
> >
> > Leader stack:
> > #3  0x0075f7a1 in WaitLatch (latch=,
wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1,
wait_event_info=wait_event_info@entry=150994945) at latch.c:411
> > #4  0x005a9245 in WaitGetFreeCopyBlock
(pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546
> > #5  0x005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88,
line_size=67108864, copy_buf_len=copy_buf_len@entry=65536,
raw_buf_ptr=raw_buf_ptr@entry=65536,
> > copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at
copyparallel.c:1572
> > #6  0x005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88)
at copy.c:4058
> > #7  0x005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88)
at copy.c:3863
> >
> > Worker stack:
> > #0  GetLinePosition (cstate=cstate@entry=0x29e1f28) at
copyparallel.c:1474
> > #1  0x005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28,
buff_count=buff_count@entry=0) at copyparallel.c:711
> > #2  0x005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28)
at copyparallel.c:885
> > #3  0x005a4f2e in NextCopyFromRawFields 
> > (cstate=cstate@entry=0x29e1f28,
fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44)
at copy.c:3615
> > #4  0x005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28,
econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at
copy.c:3696
> > #5  0x005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at
copy.c:2985
> >
>
> Thanks for providing your thoughts. I have analyzed this issue and I'm
> working on the fix for this, I will be posting a patch for this
> shortly.
>

I have fixed and provided a patch for this at [1]
[1]
https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-11-18 Thread vignesh C
On Sat, Oct 31, 2020 at 2:07 AM Tomas Vondra 
wrote:
>
> Hi,
>
> I've done a bit more testing today, and I think the parsing is busted in
> some way. Consider this:
>
>  test=# create extension random;
>  CREATE EXTENSION
>
>  test=# create table t (a text);
>  CREATE TABLE
>
>  test=# insert into t select random_string(random_int(10, 256*1024))
from generate_series(1,1);
>  INSERT 0 1
>
>  test=# copy t to '/mnt/data/t.csv';
>  COPY 1
>
>  test=# truncate t;
>  TRUNCATE TABLE
>
>  test=# copy t from '/mnt/data/t.csv';
>  COPY 1
>
>  test=# truncate t;
>  TRUNCATE TABLE
>
>  test=# copy t from '/mnt/data/t.csv' with (parallel 2);
>  ERROR:  invalid byte sequence for encoding "UTF8": 0x00
>  CONTEXT:  COPY t, line 485: "m&\nh%_a"%r]>qtCl:Q5ltvF~;2oS6@HB
>F>og,bD$Lw'nZY\tYl#BH\t{(j~ryoZ08"SGU~.}8CcTRk1\ts$@U3szCC+U1U3i@P..."
>  parallel worker
>
>
> The functions come from an extension I use to generate random data, I've
> pushed it to github [1]. The random_string() generates a random string
> with ASCII characters, symbols and a couple special characters (\r\n\t).
> The intent was to try loading data where a fields may span multiple 64kB
> blocks and may contain newlines etc.
>
> The non-parallel copy works fine, the parallel one fails. I haven't
> investigated the details, but I guess it gets confused about where a
> string starts/end, or something like that.
>

Thanks for identifying this issue, this issue is fixed in v10 patch posted
at [1]
[1]
https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-11-18 Thread vignesh C
On Fri, Nov 13, 2020 at 2:25 PM Amit Kapila  wrote:
>
> On Wed, Nov 11, 2020 at 10:42 PM vignesh C  wrote:
> >
> > On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila  wrote:
> > >
> > > On Tue, Nov 10, 2020 at 7:12 PM vignesh C  wrote:
> > > >
> > > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > >
> > > > I have worked to provide a patch for the parallel safety checks. It
> > > > checks if parallely copy can be performed, Parallel copy cannot be
> > > > performed for the following a) If relation is temporary table b) If
> > > > relation is foreign table c) If relation has non parallel safe index
> > > > expressions d) If relation has triggers present whose type is of non
> > > > before statement trigger type e) If relation has check constraint
> > > > which are not parallel safe f) If relation has partition and any
> > > > partition has the above type. This patch has the checks for it. This
> > > > patch will be used by parallel copy implementation.
> > > >
> > >
> > > How did you ensure that this is sufficient? For parallel-insert's
> > > patch we have enabled parallel-mode for Inserts and ran the tests with
> > > force_parallel_mode to see if we are not missing anything. Also, it
> > > seems there are many common things here w.r.t parallel-insert patch,
> > > is it possible to prepare this atop that patch or do you have any
> > > reason to keep this separate?
> > >
> >
> > I have done similar testing for copy too, I had set force_parallel
> > mode to regress, hardcoded in the code to pick parallel workers for
> > copy operation and ran make installcheck-world to verify. Many checks
> > in this patch are common between both patches, but I was not sure how
> > to handle it as both the projects are in-progress and are being
> > updated based on the reviewer's opinion. How to handle this?
> > Thoughts?
> >
>
> I have not studied the differences in detail but if it is possible to
> prepare it on top of that patch then there shouldn't be a problem. To
> avoid confusion if you want you can always either post the latest
> version of that patch with your patch or point to it.
>

I have made this as a separate patch as of now. I will work on to see
if I can use Greg's changes as it is or if required I will provide a
few review comments on top of Greg's patch so that it is usable for
parallel copy too and later post a separate patch with the changes on
top of it. I will retain it as a separate patch till that time.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-11-18 Thread vignesh C
On Thu, Oct 29, 2020 at 2:26 PM Daniel Westermann (DWE)
 wrote:
>
> On 27/10/2020 15:36, vignesh C wrote:
> >> Attached v9 patches have the fixes for the above comments.
>
> >I did some testing:
>
> I did some testing as well and have a cosmetic remark:
>
> postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 10);
> ERROR:  value 10 out of bounds for option "parallel"
> DETAIL:  Valid values are between "1" and "1024".
> postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000);
> ERROR:  parallel requires an integer value
> postgres=#
>
> Wouldn't it make more sense to only have one error message? The first one 
> seems to be the better message.
>

I had seen similar behavior in other places too:
postgres=# vacuum (parallel 10) t1;
ERROR:  parallel vacuum degree must be between 0 and 1024
LINE 1: vacuum (parallel 10) t1;
^
postgres=# vacuum (parallel 1000) t1;
ERROR:  parallel requires an integer value

I'm not sure if we should fix this.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-11-18 Thread vignesh C
On Wed, Oct 28, 2020 at 5:36 PM Hou, Zhijie 
wrote:
>
> Hi
>
> I found some issue in v9-0002
>
> 1.
> +
> +   elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d,
unprocessed lines:%d, offset:%d, line size:%d",
> +write_pos, lineInfo->first_block,
> +
 pg_atomic_read_u32(_blk_ptr->unprocessed_line_parts),
> +offset, pg_atomic_read_u32(>line_size));
> +
>
> write_pos or other variable to be printed here are type of uint32, I
think it'better to use '%u' in elog msg.
>

Modified it.

> 2.
> +* line_size will be set. Read the line_size again to be
sure if it is
> +* completed or partial block.
> +*/
> +   dataSize = pg_atomic_read_u32(>line_size);
> +   if (dataSize)
>
> It use dataSize( type int ) to get uint32 which seems a little dangerous.
> Is it better to define dataSize uint32 here?
>

Modified it.

> 3.
> Since function with  'Cstate' in name has been changed to 'CState'
> I think we can change function PopulateCommonCstateInfo as well.
>

Modified it.

> 4.
> +   if (pcdata->worker_line_buf_count)
>
> I think some check like the above can be 'if (xxx > 0)', which seems
easier to understand.

Modified it.

Thanks for the comments, these issues are fixed in v10 patch posted at [1]
[1]
https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-11-18 Thread vignesh C
On Thu, Oct 29, 2020 at 2:20 PM Heikki Linnakangas  wrote:
>
> On 27/10/2020 15:36, vignesh C wrote:
> > Attached v9 patches have the fixes for the above comments.
>
> I did some testing:
>
> /tmp/longdata.pl:
> 
> #!/usr/bin/perl
> #
> # Generate three rows:
> # foo
> # longdatalongdatalongdata...
> # bar
> #
> # The length of the middle row is given as command line arg.
> #
>
> my $bytes = $ARGV[0];
>
> print "foo\n";
> for(my $i = 0; $i < $bytes; $i+=8){
>  print "longdata";
> }
> print "\n";
> print "bar\n";
> 
>
> postgres=# copy longdata from program 'perl /tmp/longdata.pl 1'
> with (parallel 2);
>
> This gets stuck forever (or at least I didn't have the patience to wait
> it finish). Both worker processes are consuming 100% of CPU.
>

Thanks for identifying this issue, this issue is fixed in v10 patch posted
at [1]
[1]
https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-11-17 Thread Bharath Rupireddy
On Thu, Oct 29, 2020 at 2:54 PM Amit Kapila  wrote:
>
> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.
>
> One more point, I have noticed that some time back [1], I have given
> one suggestion related to the way workers process the set of lines
> (aka chunk). I think you can try by increasing the chunk size to say
> 100, 500, 1000 and use some shared counter to remember the number of
> chunks processed.
>

Hi, I did some analysis on using spinlock protected worker write position
i.e. each worker acquires spinlock on a shared write position to choose the
next available chunk vs each worker hops to get the next available chunk
position:

Use Case: 10mn rows, 5.6GB data, 2 indexes on integer columns, 1 index on
text column, results are of the form (no of workers, total exec time in
sec, index insertion time in sec, worker write pos get time in sec, buffer
contention event count):

With spinlock:
(1,1126.443,1060.067,0.478,*0*), (2,669.343,630.769,0.306,*26*),
(4,346.297,326.950,0.161,*89*), (8,209.600,196.417,0.088,*291*),
(16,166.113,157.086,0.065,*1468*), (20,173.884,166.013,0.067,*2700*),
(30,173.087,1166.565,0.0065,*5346*)
Without spinlock:
(1,1119.695,1054.586,0.496,*0*), (2,645.733,608.313,1.5,*8*),
(4,340.620,320.344,1.6,*58*), (8,203.985,189.644,1.3,*222*),
(16,142.997,133.045,1,*813*), (20,132.621,122.527,1.1,*1215*),
(30,135.737,126.716,1.5,*2901*)

With spinlock each worker is getting the required write position quickly
and proceeding further till the index insertion(which is becoming a single
point of contention) where we observed more buffer lock contention. Reason
is that all the workers are reaching the index insertion point at the
similar time.

Without spinlock, each worker is spending some time in hopping to get the
write position, by the time the other workers are inserting into the
indexes. So basically, all the workers are not reaching the index insertion
point at the same time and hence less buffer lock contention.

The same behaviour(explained above) is observed with different worker chunk
count(default 64, 128, 512 and 1024) i.e. the number of tuples each worker
caches into its local memory before inserting into table.

In summary: with spinlock, it looks like we are able to avoid workers
waiting to get the next chunk, which also means that we are not creating
any contention point inside the parallel copy code. However this is causing
another choking point i.e. index insertion if indexes are available on the
table, which is out of scope of parallel copy code. We think that it would
be good to use spinlock-protected worker write position or an atomic
variable for worker write position(as it performs equal to spinlock or
little better in some platforms). Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-11-13 Thread Amit Kapila
On Wed, Nov 11, 2020 at 10:42 PM vignesh C  wrote:
>
> On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila  wrote:
> >
> > On Tue, Nov 10, 2020 at 7:12 PM vignesh C  wrote:
> > >
> > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > I have worked to provide a patch for the parallel safety checks. It
> > > checks if parallely copy can be performed, Parallel copy cannot be
> > > performed for the following a) If relation is temporary table b) If
> > > relation is foreign table c) If relation has non parallel safe index
> > > expressions d) If relation has triggers present whose type is of non
> > > before statement trigger type e) If relation has check constraint
> > > which are not parallel safe f) If relation has partition and any
> > > partition has the above type. This patch has the checks for it. This
> > > patch will be used by parallel copy implementation.
> > >
> >
> > How did you ensure that this is sufficient? For parallel-insert's
> > patch we have enabled parallel-mode for Inserts and ran the tests with
> > force_parallel_mode to see if we are not missing anything. Also, it
> > seems there are many common things here w.r.t parallel-insert patch,
> > is it possible to prepare this atop that patch or do you have any
> > reason to keep this separate?
> >
>
> I have done similar testing for copy too, I had set force_parallel
> mode to regress, hardcoded in the code to pick parallel workers for
> copy operation and ran make installcheck-world to verify. Many checks
> in this patch are common between both patches, but I was not sure how
> to handle it as both the projects are in-progress and are being
> updated based on the reviewer's opinion. How to handle this?
> Thoughts?
>

I have not studied the differences in detail but if it is possible to
prepare it on top of that patch then there shouldn't be a problem. To
avoid confusion if you want you can always either post the latest
version of that patch with your patch or point to it.

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-11-11 Thread vignesh C
On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila  wrote:
>
> On Tue, Nov 10, 2020 at 7:12 PM vignesh C  wrote:
> >
> > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  wrote:
> > >
> >
> > I have worked to provide a patch for the parallel safety checks. It
> > checks if parallely copy can be performed, Parallel copy cannot be
> > performed for the following a) If relation is temporary table b) If
> > relation is foreign table c) If relation has non parallel safe index
> > expressions d) If relation has triggers present whose type is of non
> > before statement trigger type e) If relation has check constraint
> > which are not parallel safe f) If relation has partition and any
> > partition has the above type. This patch has the checks for it. This
> > patch will be used by parallel copy implementation.
> >
>
> How did you ensure that this is sufficient? For parallel-insert's
> patch we have enabled parallel-mode for Inserts and ran the tests with
> force_parallel_mode to see if we are not missing anything. Also, it
> seems there are many common things here w.r.t parallel-insert patch,
> is it possible to prepare this atop that patch or do you have any
> reason to keep this separate?
>

I have done similar testing for copy too, I had set force_parallel
mode to regress, hardcoded in the code to pick parallel workers for
copy operation and ran make installcheck-world to verify. Many checks
in this patch are common between both patches, but I was not sure how
to handle it as both the projects are in-progress and are being
updated based on the reviewer's opinion. How to handle this?
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-11-10 Thread Amit Kapila
On Tue, Nov 10, 2020 at 7:12 PM vignesh C  wrote:
>
> On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  wrote:
> >
>
> I have worked to provide a patch for the parallel safety checks. It
> checks if parallely copy can be performed, Parallel copy cannot be
> performed for the following a) If relation is temporary table b) If
> relation is foreign table c) If relation has non parallel safe index
> expressions d) If relation has triggers present whose type is of non
> before statement trigger type e) If relation has check constraint
> which are not parallel safe f) If relation has partition and any
> partition has the above type. This patch has the checks for it. This
> patch will be used by parallel copy implementation.
>

How did you ensure that this is sufficient? For parallel-insert's
patch we have enabled parallel-mode for Inserts and ran the tests with
force_parallel_mode to see if we are not missing anything. Also, it
seems there are many common things here w.r.t parallel-insert patch,
is it possible to prepare this atop that patch or do you have any
reason to keep this separate?

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-11-10 Thread vignesh C
On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  wrote:
>
> On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas  wrote:
> >
> > On 02/11/2020 08:14, Amit Kapila wrote:
> > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  
> > > wrote:
> > >>
> > >> In this design, you don't need to keep line boundaries in shared memory,
> > >> because each worker process is responsible for finding the line
> > >> boundaries of its own block.
> > >>
> > >> There's a point of serialization here, in that the next block cannot be
> > >> processed, until the worker working on the previous block has finished
> > >> scanning the EOLs, and set the starting position on the next block,
> > >> putting it in READY state. That's not very different from your patch,
> > >> where you had a similar point of serialization because the leader
> > >> scanned the EOLs,
> > >
> > > But in the design (single producer multiple consumer) used by the
> > > patch the worker doesn't need to wait till the complete block is
> > > processed, it can start processing the lines already found. This will
> > > also allow workers to start much earlier to process the data as it
> > > doesn't need to wait for all the offsets corresponding to 64K block
> > > ready. However, in the design where each worker is processing the 64K
> > > block, it can lead to much longer waits. I think this will impact the
> > > Copy STDIN case more where in most cases (200-300 bytes tuples) we
> > > receive line-by-line from client and find the line-endings by leader.
> > > If the leader doesn't find the line-endings the workers need to wait
> > > till the leader fill the entire 64K chunk, OTOH, with current approach
> > > the worker can start as soon as leader is able to populate some
> > > minimum number of line-endings
> >
> > You can use a smaller block size.
> >
>
> Sure, but the same problem can happen if the last line in that block
> is too long and we need to peek into the next block. And then there
> could be cases where a single line could be greater than 64K.
>
> > However, the point of parallel copy is
> > to maximize bandwidth.
> >
>
> Okay, but this first-phase (finding the line boundaries) can anyway be
> not done in parallel and we have seen in some of the initial
> benchmarking that this initial phase is a small part of work
> especially when the table has indexes, constraints, etc. So, I think
> it won't matter much if this splitting is done in a single process or
> multiple processes.
>
> > If the workers ever have to sit idle, it means
> > that the bottleneck is in receiving data from the client, i.e. the
> > backend is fast enough, and you can't make the overall COPY finish any
> > faster no matter how you do it.
> >
> > > The other point is that the leader backend won't be used completely as
> > > it is only doing a very small part (primarily reading the file) of the
> > > overall work.
> >
> > An idle process doesn't cost anything. If you have free CPU resources,
> > use more workers.
> >
> > > We have discussed both these approaches (a) single producer multiple
> > > consumer, and (b) all workers doing the processing as you are saying
> > > in the beginning and concluded that (a) is better, see some of the
> > > relevant emails [1][2][3].
> > >
> > > [1] - 
> > > https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
> > > [2] - 
> > > https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
> > > [3] - 
> > > https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de
> >
> > Sorry I'm late to the party. I don't think the design I proposed was
> > discussed in that threads.
> >
>
> I think something close to that is discussed as you have noticed in
> your next email but IIRC, because many people (Andres, Ants, myself
> and author) favoured the current approach (single reader and multiple
> consumers) we decided to go with that. I feel this patch is very much
> in the POC stage due to which the code doesn't look good and as we
> move forward we need to see what is the better way to improve it,
> maybe one of the ways is to split it as you are suggesting so that it
> can be easier to review. I think the other important thing which this
> patch has not addressed properly is the parallel-safety checks as
> pointed by me earlier. There are two things to solve there (a) the

Re: Parallel copy

2020-11-07 Thread vignesh C
On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie  wrote:
>
> Hi
>
> >
> > my $bytes = $ARGV[0];
> > for(my $i = 0; $i < $bytes; $i+=8){
> >  print "longdata";
> > }
> > print "\n";
> > 
> >
> > postgres=# copy longdata from program 'perl /tmp/longdata.pl 1'
> > with (parallel 2);
> >
> > This gets stuck forever (or at least I didn't have the patience to wait
> > it finish). Both worker processes are consuming 100% of CPU.
>
> I had a look over this problem.
>
> the ParallelCopyDataBlock has size limit:
> uint8   skip_bytes;
> chardata[DATA_BLOCK_SIZE];  /* data read from file */
>
> It seems the input line is so long that the leader process run out of the 
> Shared memory among parallel copy workers.
> And the leader process keep waiting free block.
>
> For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED,
> But leader process won't set the line_state unless it read the whole line.
>
> So it stuck forever.
> May be we should reconsider about this situation.
>
> The stack is as follows:
>
> Leader stack:
> #3  0x0075f7a1 in WaitLatch (latch=, 
> wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, 
> wait_event_info=wait_event_info@entry=150994945) at latch.c:411
> #4  0x005a9245 in WaitGetFreeCopyBlock 
> (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546
> #5  0x005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, 
> line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, 
> raw_buf_ptr=raw_buf_ptr@entry=65536,
> copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572
> #6  0x005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at 
> copy.c:4058
> #7  0x005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at 
> copy.c:3863
>
> Worker stack:
> #0  GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474
> #1  0x005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, 
> buff_count=buff_count@entry=0) at copyparallel.c:711
> #2  0x005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at 
> copyparallel.c:885
> #3  0x005a4f2e in NextCopyFromRawFields 
> (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48, 
> nfields=nfields@entry=0x7fff4cdc0b44) at copy.c:3615
> #4  0x005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, 
> econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at 
> copy.c:3696
> #5  0x005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at 
> copy.c:2985
>

Thanks for providing your thoughts. I have analyzed this issue and I'm
working on the fix for this, I will be posting a patch for this
shortly.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




RE: Parallel copy

2020-11-05 Thread Hou, Zhijie
Hi

> 
> my $bytes = $ARGV[0];
> for(my $i = 0; $i < $bytes; $i+=8){
>  print "longdata";
> }
> print "\n";
> 
> 
> postgres=# copy longdata from program 'perl /tmp/longdata.pl 1'
> with (parallel 2);
> 
> This gets stuck forever (or at least I didn't have the patience to wait
> it finish). Both worker processes are consuming 100% of CPU.

I had a look over this problem.

the ParallelCopyDataBlock has size limit:
uint8   skip_bytes;
chardata[DATA_BLOCK_SIZE];  /* data read from file */

It seems the input line is so long that the leader process run out of the 
Shared memory among parallel copy workers.
And the leader process keep waiting free block.

For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED,
But leader process won't set the line_state unless it read the whole line.

So it stuck forever.
May be we should reconsider about this situation.

The stack is as follows:

Leader stack:
#3  0x0075f7a1 in WaitLatch (latch=, 
wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, 
wait_event_info=wait_event_info@entry=150994945) at latch.c:411
#4  0x005a9245 in WaitGetFreeCopyBlock 
(pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546
#5  0x005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, 
line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, 
raw_buf_ptr=raw_buf_ptr@entry=65536, 
copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572
#6  0x005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at 
copy.c:4058
#7  0x005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at 
copy.c:3863

Worker stack:
#0  GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474
#1  0x005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, 
buff_count=buff_count@entry=0) at copyparallel.c:711
#2  0x005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at 
copyparallel.c:885
#3  0x005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, 
fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44) at 
copy.c:3615
#4  0x005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, 
econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at 
copy.c:3696
#5  0x005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at 
copy.c:2985


Best regards,
houzj





Re: Parallel copy

2020-11-03 Thread Heikki Linnakangas

On 03/11/2020 10:59, Amit Kapila wrote:

On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas  wrote:

However, the point of parallel copy is to maximize bandwidth.


Okay, but this first-phase (finding the line boundaries) can anyway
be not done in parallel and we have seen in some of the initial 
benchmarking that this initial phase is a small part of work 
especially when the table has indexes, constraints, etc. So, I think 
it won't matter much if this splitting is done in a single process

or multiple processes.
Right, it won't matter performance-wise. That's not my point. The 
difference is in the complexity. If you don't store the line boundaries 
in shared memory, you get away with much simpler shared memory structures.



I think something close to that is discussed as you have noticed in
your next email but IIRC, because many people (Andres, Ants, myself
and author) favoured the current approach (single reader and multiple
consumers) we decided to go with that. I feel this patch is very much
in the POC stage due to which the code doesn't look good and as we
move forward we need to see what is the better way to improve it,
maybe one of the ways is to split it as you are suggesting so that it
can be easier to review.


Sure. I think the roadmap here is:

1. Split copy.c [1]. Not strictly necessary, but I think it'd make this 
nice to review and work with.


2. Refactor CopyReadLine(), so that finding the line-endings and the 
rest of the line-parsing are separated into separate functions.


3. Implement parallel copy.


I think the other important thing which this
patch has not addressed properly is the parallel-safety checks as
pointed by me earlier. There are two things to solve there (a) the
lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
APIs, etc.) have checks which doesn't allow any writes, we need to see
which of those we can open now (or do some additional work to prevent
from those checks) after some of the work done for parallel-writes in
PG-13[1][2], and (b) in which all cases we can parallel-writes
(parallel copy) is allowed, for example need to identify whether table
or one of its partitions has any constraint/expression which is
parallel-unsafe.


Agreed, that needs to be solved. I haven't given it any thought myself.

- Heikki

[1] 
https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi





Re: Parallel copy

2020-11-03 Thread Amit Kapila
On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas  wrote:
>
> On 02/11/2020 08:14, Amit Kapila wrote:
> > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  wrote:
> >>
> >> In this design, you don't need to keep line boundaries in shared memory,
> >> because each worker process is responsible for finding the line
> >> boundaries of its own block.
> >>
> >> There's a point of serialization here, in that the next block cannot be
> >> processed, until the worker working on the previous block has finished
> >> scanning the EOLs, and set the starting position on the next block,
> >> putting it in READY state. That's not very different from your patch,
> >> where you had a similar point of serialization because the leader
> >> scanned the EOLs,
> >
> > But in the design (single producer multiple consumer) used by the
> > patch the worker doesn't need to wait till the complete block is
> > processed, it can start processing the lines already found. This will
> > also allow workers to start much earlier to process the data as it
> > doesn't need to wait for all the offsets corresponding to 64K block
> > ready. However, in the design where each worker is processing the 64K
> > block, it can lead to much longer waits. I think this will impact the
> > Copy STDIN case more where in most cases (200-300 bytes tuples) we
> > receive line-by-line from client and find the line-endings by leader.
> > If the leader doesn't find the line-endings the workers need to wait
> > till the leader fill the entire 64K chunk, OTOH, with current approach
> > the worker can start as soon as leader is able to populate some
> > minimum number of line-endings
>
> You can use a smaller block size.
>

Sure, but the same problem can happen if the last line in that block
is too long and we need to peek into the next block. And then there
could be cases where a single line could be greater than 64K.

> However, the point of parallel copy is
> to maximize bandwidth.
>

Okay, but this first-phase (finding the line boundaries) can anyway be
not done in parallel and we have seen in some of the initial
benchmarking that this initial phase is a small part of work
especially when the table has indexes, constraints, etc. So, I think
it won't matter much if this splitting is done in a single process or
multiple processes.

> If the workers ever have to sit idle, it means
> that the bottleneck is in receiving data from the client, i.e. the
> backend is fast enough, and you can't make the overall COPY finish any
> faster no matter how you do it.
>
> > The other point is that the leader backend won't be used completely as
> > it is only doing a very small part (primarily reading the file) of the
> > overall work.
>
> An idle process doesn't cost anything. If you have free CPU resources,
> use more workers.
>
> > We have discussed both these approaches (a) single producer multiple
> > consumer, and (b) all workers doing the processing as you are saying
> > in the beginning and concluded that (a) is better, see some of the
> > relevant emails [1][2][3].
> >
> > [1] - 
> > https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
> > [2] - 
> > https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
> > [3] - 
> > https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de
>
> Sorry I'm late to the party. I don't think the design I proposed was
> discussed in that threads.
>

I think something close to that is discussed as you have noticed in
your next email but IIRC, because many people (Andres, Ants, myself
and author) favoured the current approach (single reader and multiple
consumers) we decided to go with that. I feel this patch is very much
in the POC stage due to which the code doesn't look good and as we
move forward we need to see what is the better way to improve it,
maybe one of the ways is to split it as you are suggesting so that it
can be easier to review. I think the other important thing which this
patch has not addressed properly is the parallel-safety checks as
pointed by me earlier. There are two things to solve there (a) the
lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
APIs, etc.) have checks which doesn't allow any writes, we need to see
which of those we can open now (or do some additional work to prevent
from those checks) after some of the work done for parallel-writes in
PG-13[1][2], and (b) in which all cases we can parallel-writes
(parallel copy) is allowed, for example need to identify whether table
or one of its partitions has any constraint/expression which is
parallel-unsafe.

Re: Parallel copy

2020-11-01 Thread Heikki Linnakangas

On 02/11/2020 09:10, Heikki Linnakangas wrote:

On 02/11/2020 08:14, Amit Kapila wrote:

We have discussed both these approaches (a) single producer multiple
consumer, and (b) all workers doing the processing as you are saying
in the beginning and concluded that (a) is better, see some of the
relevant emails [1][2][3].

[1] - 
https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
[2] - 
https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
[3] - 
https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de


Sorry I'm late to the party. I don't think the design I proposed was
discussed in that threads. The alternative that's discussed in that
thread seems to be something much more fine-grained, where processes
claim individual lines. I'm not sure though, I didn't fully understand
the alternative designs.


I read the thread more carefully, and I think Robert had basically the 
right idea here 
(https://www.postgresql.org/message-id/CA%2BTgmoZMU4az9MmdJtg04pjRa0wmWQtmoMxttdxNrupYJNcR3w%40mail.gmail.com):



I really think we don't want a single worker in charge of finding
tuple boundaries for everybody. That adds a lot of unnecessary
inter-process communication and synchronization. Each process should
just get the next tuple starting after where the last one ended, and
then advance the end pointer so that the next process can do the same
thing. [...]


And here 
(https://www.postgresql.org/message-id/CA%2BTgmoZw%2BF3y%2BoaxEsHEZBxdL1x1KAJ7pRMNgCqX0WjmjGNLrA%40mail.gmail.com):



On Thu, Apr 9, 2020 at 2:55 PM Andres Freund

 wrote:

I'm fairly certain that we do *not* want to distribute input data
between processes on a single tuple basis. Probably not even below
a few

hundred kb. If there's any sort of natural clustering in the loaded data
- extremely common, think timestamps - splitting on a granular basis
will make indexing much more expensive. And have a lot more contention.


That's a fair point. I think the solution ought to be that once any
process starts finding line endings, it continues until it's grabbed
at least a certain amount of data for itself. Then it stops and lets
some other process grab a chunk of data.
Yes! That's pretty close to the design I sketched. I imagined that the 
leader would divide the input into 64 kB blocks, and each block would 
have  few metadata fields, notably the starting position of the first 
line in the block. I think Robert envisioned having a single "next 
starting position" field in shared memory. That works too, and is even 
simpler, so +1 for that.


For some reason, the discussion took a different turn from there, to 
discuss how the line-endings (called "chunks" in the discussion) should 
be represented in shared memory. But none of that is necessary with 
Robert's design.


- Heikki




Re: Parallel copy

2020-11-01 Thread Heikki Linnakangas

On 02/11/2020 08:14, Amit Kapila wrote:

On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  wrote:


Leader process:

The leader process is simple. It picks the next FREE buffer, fills it
with raw data from the file, and marks it as FILLED. If no buffers are
FREE, wait.

Worker process:

1. Claim next READY block from queue, by changing its state to
 PROCESSING. If the next block is not READY yet, wait until it is.

2. Start scanning the block from 'startpos', finding end-of-line
 markers. (in CSV mode, need to track when we're in-quotes).

3. When you reach the end of the block, if the last line continues to
 next block, wait for the next block to become FILLED. Peek into the
 next block, and copy the remaining part of the split line to a local
 buffer, and set the 'startpos' on the next block to point to the end
 of the split line. Mark the next block as READY.

4. Process all the lines in the block, call input functions, insert
 rows.

5. Mark the block as DONE.

In this design, you don't need to keep line boundaries in shared memory,
because each worker process is responsible for finding the line
boundaries of its own block.

There's a point of serialization here, in that the next block cannot be
processed, until the worker working on the previous block has finished
scanning the EOLs, and set the starting position on the next block,
putting it in READY state. That's not very different from your patch,
where you had a similar point of serialization because the leader
scanned the EOLs,


But in the design (single producer multiple consumer) used by the
patch the worker doesn't need to wait till the complete block is
processed, it can start processing the lines already found. This will
also allow workers to start much earlier to process the data as it
doesn't need to wait for all the offsets corresponding to 64K block
ready. However, in the design where each worker is processing the 64K
block, it can lead to much longer waits. I think this will impact the
Copy STDIN case more where in most cases (200-300 bytes tuples) we
receive line-by-line from client and find the line-endings by leader.
If the leader doesn't find the line-endings the workers need to wait
till the leader fill the entire 64K chunk, OTOH, with current approach
the worker can start as soon as leader is able to populate some
minimum number of line-endings


You can use a smaller block size. However, the point of parallel copy is 
to maximize bandwidth. If the workers ever have to sit idle, it means 
that the bottleneck is in receiving data from the client, i.e. the 
backend is fast enough, and you can't make the overall COPY finish any 
faster no matter how you do it.



The other point is that the leader backend won't be used completely as
it is only doing a very small part (primarily reading the file) of the
overall work.


An idle process doesn't cost anything. If you have free CPU resources, 
use more workers.



We have discussed both these approaches (a) single producer multiple
consumer, and (b) all workers doing the processing as you are saying
in the beginning and concluded that (a) is better, see some of the
relevant emails [1][2][3].

[1] - 
https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
[2] - 
https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
[3] - 
https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de


Sorry I'm late to the party. I don't think the design I proposed was 
discussed in that threads. The alternative that's discussed in that 
thread seems to be something much more fine-grained, where processes 
claim individual lines. I'm not sure though, I didn't fully understand 
the alternative designs.


I want to throw out one more idea. It's an interim step, not the 	 final 
solution we want, but a useful step in getting there:


Have the leader process scan the input for line-endings. Split the input 
data into blocks of slightly under 64 kB in size, so that a line never 
crosses a block. Put the blocks in shared memory.


A worker process claims a block from shared memory, processes it from 
beginning to end. It *also* has to parse the input to split it into lines.


In this design, the line-splitting is done twice. That's clearly not 
optimal, and we want to avoid that in the final patch, but I think it 
would be a useful milestone. After that patch is done, write another 
patch to either a) implement the design I sketched, where blocks are 
fixed-size and a worker notifies the next worker on where the first line 
in next block begins, or b) have the leader process report the 
line-ending positions in shared memory, so that workers don't need to 
scan them again.


Even if we apply the patches together, I think splitting them like that 
would make for easier review.


- Heikki




Re: Parallel copy

2020-11-01 Thread Amit Kapila
On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  wrote:
>
> Leader process:
>
> The leader process is simple. It picks the next FREE buffer, fills it
> with raw data from the file, and marks it as FILLED. If no buffers are
> FREE, wait.
>
> Worker process:
>
> 1. Claim next READY block from queue, by changing its state to
> PROCESSING. If the next block is not READY yet, wait until it is.
>
> 2. Start scanning the block from 'startpos', finding end-of-line
> markers. (in CSV mode, need to track when we're in-quotes).
>
> 3. When you reach the end of the block, if the last line continues to
> next block, wait for the next block to become FILLED. Peek into the
> next block, and copy the remaining part of the split line to a local
> buffer, and set the 'startpos' on the next block to point to the end
> of the split line. Mark the next block as READY.
>
> 4. Process all the lines in the block, call input functions, insert
> rows.
>
> 5. Mark the block as DONE.
>
> In this design, you don't need to keep line boundaries in shared memory,
> because each worker process is responsible for finding the line
> boundaries of its own block.
>
> There's a point of serialization here, in that the next block cannot be
> processed, until the worker working on the previous block has finished
> scanning the EOLs, and set the starting position on the next block,
> putting it in READY state. That's not very different from your patch,
> where you had a similar point of serialization because the leader
> scanned the EOLs,
>

But in the design (single producer multiple consumer) used by the
patch the worker doesn't need to wait till the complete block is
processed, it can start processing the lines already found. This will
also allow workers to start much earlier to process the data as it
doesn't need to wait for all the offsets corresponding to 64K block
ready. However, in the design where each worker is processing the 64K
block, it can lead to much longer waits. I think this will impact the
Copy STDIN case more where in most cases (200-300 bytes tuples) we
receive line-by-line from client and find the line-endings by leader.
If the leader doesn't find the line-endings the workers need to wait
till the leader fill the entire 64K chunk, OTOH, with current approach
the worker can start as soon as leader is able to populate some
minimum number of line-endings

The other point is that the leader backend won't be used completely as
it is only doing a very small part (primarily reading the file) of the
overall work.

We have discussed both these approaches (a) single producer multiple
consumer, and (b) all workers doing the processing as you are saying
in the beginning and concluded that (a) is better, see some of the
relevant emails [1][2][3].

[1] - 
https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
[2] - 
https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
[3] - 
https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-31 Thread Tomas Vondra

On Sat, Oct 31, 2020 at 12:09:32AM +0200, Heikki Linnakangas wrote:

On 30/10/2020 22:56, Tomas Vondra wrote:

I agree this design looks simpler. I'm a bit worried about serializing
the parsing like this, though. It's true the current approach (where the
first phase of parsing happens in the leader) has a similar issue, but I
think it would be easier to improve that in that design.

My plan was to parallelize the parsing roughly like this:

1) split the input buffer into smaller chunks

2) let workers scan the buffers and record positions of interesting
characters (delimiters, quotes, ...) and pass it back to the leader

3) use the information to actually parse the input data (we only need to
look at the interesting characters, skipping large parts of data)

4) pass the parsed chunks to workers, just like in the current patch


But maybe something like that would be possible even with the approach
you propose - we could have a special parse phase for processing each
buffer, where any worker could look for the special characters, record
the positions in a bitmap next to the buffer. So the whole sequence of
states would look something like this:

 EMPTY
 FILLED
 PARSED
 READY
 PROCESSING


I think it's even simpler than that. You don't need to communicate the 
"interesting positions" between processes, if the same worker takes 
care of the chunk through all states from FILLED to DONE.


You can build the bitmap of interesting positions immediately in 
FILLED state, independently of all previous blocks. Once you've built 
the bitmap, you need to wait for the information on where the first 
line starts, but presumably finding the interesting positions is the 
expensive part.




I don't think it's that simple. For example, the previous block may
contain a very long value (say, 1MB), so a bunch of blocks have to be
processed by the same worker. That probably makes the state transitions
a bit, and it also means the bitmap would need to be passed to the
worker that actually processes the block. Or we might just ignore this,
on the grounds that it's not a very common situation.



Of course, the question is whether parsing really is sufficiently
expensive for this to be worth it.


Yeah, I don't think it's worth it. Splitting the lines is pretty fast, 
I think we have many years to come before that becomes a bottleneck. 
But if it turns out I'm wrong and we need to implement that, the path 
is pretty straightforward.




OK. I agree the parsing is relatively cheap, and I don't recall seeing
CSV parsing as a bottleneck in production.  I suspect that's might be
simply because we're hitting other bottlenecks first, though.

regard

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Parallel copy

2020-10-30 Thread Heikki Linnakangas

On 30/10/2020 22:56, Tomas Vondra wrote:

I agree this design looks simpler. I'm a bit worried about serializing
the parsing like this, though. It's true the current approach (where the
first phase of parsing happens in the leader) has a similar issue, but I
think it would be easier to improve that in that design.

My plan was to parallelize the parsing roughly like this:

1) split the input buffer into smaller chunks

2) let workers scan the buffers and record positions of interesting
characters (delimiters, quotes, ...) and pass it back to the leader

3) use the information to actually parse the input data (we only need to
look at the interesting characters, skipping large parts of data)

4) pass the parsed chunks to workers, just like in the current patch


But maybe something like that would be possible even with the approach
you propose - we could have a special parse phase for processing each
buffer, where any worker could look for the special characters, record
the positions in a bitmap next to the buffer. So the whole sequence of
states would look something like this:

  EMPTY
  FILLED
  PARSED
  READY
  PROCESSING


I think it's even simpler than that. You don't need to communicate the 
"interesting positions" between processes, if the same worker takes care 
of the chunk through all states from FILLED to DONE.


You can build the bitmap of interesting positions immediately in FILLED 
state, independently of all previous blocks. Once you've built the 
bitmap, you need to wait for the information on where the first line 
starts, but presumably finding the interesting positions is the 
expensive part.



Of course, the question is whether parsing really is sufficiently
expensive for this to be worth it.


Yeah, I don't think it's worth it. Splitting the lines is pretty fast, I 
think we have many years to come before that becomes a bottleneck. But 
if it turns out I'm wrong and we need to implement that, the path is 
pretty straightforward.


- Heikki




Re: Parallel copy

2020-10-30 Thread Tomas Vondra

On Fri, Oct 30, 2020 at 06:41:41PM +0200, Heikki Linnakangas wrote:

On 30/10/2020 18:36, Heikki Linnakangas wrote:

I find this design to be very complicated. Why does the line-boundary
information need to be in shared memory? I think this would be much
simpler if each worker grabbed a fixed-size block of raw data, and
processed that.

In your patch, the leader process scans the input to find out where one
line ends and another begins, and because of that decision, the leader
needs to make the line boundaries available in shared memory, for the
worker processes. If we moved that responsibility to the worker
processes, you wouldn't need to keep the line boundaries in shared
memory. A worker would only need to pass enough state to the next worker
to tell it where to start scanning the next block.


Here's a high-level sketch of how I'm imagining this to work:

The shared memory structure consists of a queue of blocks, arranged as 
a ring buffer. Each block is of fixed size, and contains 64 kB of 
data, and a few fields for coordination:


typedef struct
{
   /* Current state of the block */
   pg_atomic_uint32 state;

   /* starting offset of first line within the block */
   int startpos;

   chardata[64 kB];
} ParallelCopyDataBlock;

Where state is one of:

enum {
 FREE,   /* buffer is empty */
 FILLED, /* leader has filled the buffer with raw data */
 READY,  /* start pos has been filled in, but no worker process 
has claimed the block yet */

 PROCESSING, /* worker has claimed the block, and is processing it */
}

State changes FREE -> FILLED -> READY -> PROCESSING -> FREE. As the 
COPY progresses, the ring of blocks will always look something like 
this:


blk 0 startpos  0: PROCESSING [worker 1]
blk 1 startpos 12: PROCESSING [worker 2]
blk 2 startpos 10: READY
blk 3 starptos  -: FILLED
blk 4 startpos  -: FILLED
blk 5 starptos  -: FILLED
blk 6 startpos  -: FREE
blk 7 startpos  -: FREE

Typically, each worker process is busy processing a block. After the 
blocks being processed, there is one block in READY state, and after 
that, blocks in FILLED state.


Leader process:

The leader process is simple. It picks the next FREE buffer, fills it 
with raw data from the file, and marks it as FILLED. If no buffers are 
FREE, wait.


Worker process:

1. Claim next READY block from queue, by changing its state to
  PROCESSING. If the next block is not READY yet, wait until it is.

2. Start scanning the block from 'startpos', finding end-of-line
  markers. (in CSV mode, need to track when we're in-quotes).

3. When you reach the end of the block, if the last line continues to
  next block, wait for the next block to become FILLED. Peek into the
  next block, and copy the remaining part of the split line to a local
  buffer, and set the 'startpos' on the next block to point to the end
  of the split line. Mark the next block as READY.

4. Process all the lines in the block, call input functions, insert
  rows.

5. Mark the block as DONE.

In this design, you don't need to keep line boundaries in shared 
memory, because each worker process is responsible for finding the 
line boundaries of its own block.


There's a point of serialization here, in that the next block cannot 
be processed, until the worker working on the previous block has 
finished scanning the EOLs, and set the starting position on the next 
block, putting it in READY state. That's not very different from your 
patch, where you had a similar point of serialization because the 
leader scanned the EOLs, but I think the coordination between 
processes is simpler here.




I agree this design looks simpler. I'm a bit worried about serializing
the parsing like this, though. It's true the current approach (where the
first phase of parsing happens in the leader) has a similar issue, but I
think it would be easier to improve that in that design.

My plan was to parallelize the parsing roughly like this:

1) split the input buffer into smaller chunks

2) let workers scan the buffers and record positions of interesting
characters (delimiters, quotes, ...) and pass it back to the leader

3) use the information to actually parse the input data (we only need to
look at the interesting characters, skipping large parts of data)

4) pass the parsed chunks to workers, just like in the current patch


But maybe something like that would be possible even with the approach
you propose - we could have a special parse phase for processing each
buffer, where any worker could look for the special characters, record
the positions in a bitmap next to the buffer. So the whole sequence of
states would look something like this:

EMPTY
FILLED
PARSED
READY
PROCESSING

Of course, the question is whether parsing really is sufficiently
expensive for this to be worth it.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Parallel copy

2020-10-30 Thread Tomas Vondra

Hi,

I've done a bit more testing today, and I think the parsing is busted in
some way. Consider this:

test=# create extension random;
CREATE EXTENSION

test=# create table t (a text);

CREATE TABLE

test=# insert into t select random_string(random_int(10, 256*1024)) from generate_series(1,1);

INSERT 0 1

test=# copy t to '/mnt/data/t.csv';

COPY 1

test=# truncate t;

TRUNCATE TABLE

test=# copy t from '/mnt/data/t.csv';

COPY 1

test=# truncate t;

TRUNCATE TABLE

test=# copy t from '/mnt/data/t.csv' with (parallel 2);

ERROR:  invalid byte sequence for encoding "UTF8": 0x00
CONTEXT:  COPY t, line 485: 
"m&\nh%_a"%r]>qtCl:Q5ltvF~;2oS6@HB>F>og,bD$Lw'nZY\tYl#BH\t{(j~ryoZ08"SGU~.}8CcTRk1\ts$@U3szCC+U1U3i@P..."
parallel worker


The functions come from an extension I use to generate random data, I've
pushed it to github [1]. The random_string() generates a random string
with ASCII characters, symbols and a couple special characters (\r\n\t).
The intent was to try loading data where a fields may span multiple 64kB
blocks and may contain newlines etc.

The non-parallel copy works fine, the parallel one fails. I haven't
investigated the details, but I guess it gets confused about where a
string starts/end, or something like that.


[1] https://github.com/tvondra/random


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Parallel copy

2020-10-30 Thread Heikki Linnakangas

On 30/10/2020 18:36, Heikki Linnakangas wrote:

Whether the leader process finds the EOLs or the worker processes, it's
pretty clear that it needs to be done ASAP, for a chunk at a time,
because that cannot be done in parallel. I think some refactoring in
CopyReadLine() and friends would be in order. It probably would be
faster, or at least not slower, to find all the EOLs in a block in one
tight loop, even when parallel copy is not used.


Something like the attached. It passes the regression tests, but it's 
quite incomplete. It's missing handing of "\." as end-of-file marker, 
and I haven't tested encoding conversions at all, for starters. Quick 
testing suggests that this a little bit faster than the current code, 
but the difference is small; I had to use a "WHERE false" option to 
really see the difference.


The crucial thing here is that there's a new function, ParseLinesText(), 
to find all end-of-line characters in a buffer in one go. In this patch, 
it's used against 'raw_buf', but with parallel copy, you could point it 
at a block in shared memory instead.


- Heikki
>From af3be3bd4e77b66f4605393617da0d15ec21e15b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 30 Oct 2020 18:51:10 +0200
Subject: [PATCH 1/1] WIP: Find all line-endings in COPY in chunks.

Refactor CopyReadLines and friends to find all the line-endings in the
buffer in one go, before splitting the lines further.
---
 src/backend/commands/copy.c | 972 
 1 file changed, 536 insertions(+), 436 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 36ddcdccdb8..fbf11cb2550 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -95,6 +95,18 @@ typedef enum CopyInsertMethod
 	CIM_MULTI_CONDITIONAL		/* use table_multi_insert only if valid */
 } CopyInsertMethod;
 
+
+/*
+ * Represents the heap insert method to be used during COPY FROM.
+ */
+typedef enum ParseLinesState
+{
+	PLSTATE_NORMAL,
+	PLSTATE_ESCAPE,
+	PLSTATE_IN_QUOTE,
+	PLSTATE_ESCAPE_IN_QUOTE
+} ParseLinesState;
+
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -110,6 +122,24 @@ typedef enum CopyInsertMethod
  * it's faster to make useless comparisons to trailing bytes than it is to
  * invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is true
  * when we have to do it the hard way.
+ *
+ * COPY FROM buffers:
+ *
+ * In COPY FROM processing, there are three levels of buffers:
+ *
+ * raw_buf   - contains raw data read from file/client
+ * converted_buf - contains the data in 'raw_buf', but converted to server encoding
+ * line_buf  - contains "current" line of data, without the end-of-line char
+ *
+ *
+ * In simple cases, no encoding conversion are needed, and converted_buf always
+ * points to raw_buf. If the encoding_embeds_ascii==true, encoding conversion is
+ * performed on the raw buffer, before splitting it to lines. converted_buf contains
+ * the converted version in that case.
+ *
+ * Usually, line_buf pointer points in the middle of converted_buf, but when a line
+ * is split by a raw-buffer boundary, the incomplete line is reassembled
+ * in a separate buffer (split_line_buf), and line_buf points to that.
  */
 typedef struct CopyStateData
 {
@@ -205,16 +235,34 @@ typedef struct CopyStateData
 	char	  **raw_fields;
 
 	/*
-	 * Similarly, line_buf holds the whole input line being processed. The
+	 * These variables are used to track state of parsing raw data into
+	 * lines in COPY FROM.
+	 */
+	bool		last_was_cr;
+	ParseLinesState parse_lines_state;
+
+	int			last_line_no; /* last line in 'endlines', -1 if EOF not reached yet */
+
+	int			nextline;
+	int		   *endlines; /* line ending positions within raw_buf */
+	int			numlines;
+
+	/* split_line_buf holds partial line carried over from previous buf */
+	StringInfoData split_line_buf;
+
+	/*
+	 * Similarly, line_buf holds the current input line being processed. The
 	 * input cycle is first to read the whole line into line_buf, convert it
 	 * to server encoding there, and then extract the individual attribute
 	 * fields into attribute_buf.  line_buf is preserved unmodified so that we
 	 * can display it in error messages if appropriate.  (In binary mode,
 	 * line_buf is not used.)
 	 */
-	StringInfoData line_buf;
+	char	   *line_buf;
+	int			line_len;
 	bool		line_buf_converted; /* converted to server encoding? */
 	bool		line_buf_valid; /* contains the row being processed? */
+	bool		line_buf_alloced;
 
 	/*
 	 * Finally, raw_buf holds raw data read from the data source (file or
@@ -230,6 +278,9 @@ typedef struct CopyStateData
 	int			raw_buf_len;	/* total # of bytes stored */
 	/* Shorthand for number of unconsumed bytes available in raw_buf */
 #define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - (cstate)->raw_buf_i

Re: Parallel copy

2020-10-30 Thread Heikki Linnakangas

On 30/10/2020 18:36, Heikki Linnakangas wrote:

I find this design to be very complicated. Why does the line-boundary
information need to be in shared memory? I think this would be much
simpler if each worker grabbed a fixed-size block of raw data, and
processed that.

In your patch, the leader process scans the input to find out where one
line ends and another begins, and because of that decision, the leader
needs to make the line boundaries available in shared memory, for the
worker processes. If we moved that responsibility to the worker
processes, you wouldn't need to keep the line boundaries in shared
memory. A worker would only need to pass enough state to the next worker
to tell it where to start scanning the next block.


Here's a high-level sketch of how I'm imagining this to work:

The shared memory structure consists of a queue of blocks, arranged as a 
ring buffer. Each block is of fixed size, and contains 64 kB of data, 
and a few fields for coordination:


typedef struct
{
/* Current state of the block */
pg_atomic_uint32 state;

/* starting offset of first line within the block */
int startpos;

chardata[64 kB];
} ParallelCopyDataBlock;

Where state is one of:

enum {
  FREE,   /* buffer is empty */
  FILLED, /* leader has filled the buffer with raw data */
  READY,  /* start pos has been filled in, but no worker process 
has claimed the block yet */

  PROCESSING, /* worker has claimed the block, and is processing it */
}

State changes FREE -> FILLED -> READY -> PROCESSING -> FREE. As the COPY 
progresses, the ring of blocks will always look something like this:


blk 0 startpos  0: PROCESSING [worker 1]
blk 1 startpos 12: PROCESSING [worker 2]
blk 2 startpos 10: READY
blk 3 starptos  -: FILLED
blk 4 startpos  -: FILLED
blk 5 starptos  -: FILLED
blk 6 startpos  -: FREE
blk 7 startpos  -: FREE

Typically, each worker process is busy processing a block. After the 
blocks being processed, there is one block in READY state, and after 
that, blocks in FILLED state.


Leader process:

The leader process is simple. It picks the next FREE buffer, fills it 
with raw data from the file, and marks it as FILLED. If no buffers are 
FREE, wait.


Worker process:

1. Claim next READY block from queue, by changing its state to
   PROCESSING. If the next block is not READY yet, wait until it is.

2. Start scanning the block from 'startpos', finding end-of-line
   markers. (in CSV mode, need to track when we're in-quotes).

3. When you reach the end of the block, if the last line continues to
   next block, wait for the next block to become FILLED. Peek into the
   next block, and copy the remaining part of the split line to a local
   buffer, and set the 'startpos' on the next block to point to the end
   of the split line. Mark the next block as READY.

4. Process all the lines in the block, call input functions, insert
   rows.

5. Mark the block as DONE.

In this design, you don't need to keep line boundaries in shared memory, 
because each worker process is responsible for finding the line 
boundaries of its own block.


There's a point of serialization here, in that the next block cannot be 
processed, until the worker working on the previous block has finished 
scanning the EOLs, and set the starting position on the next block, 
putting it in READY state. That's not very different from your patch, 
where you had a similar point of serialization because the leader 
scanned the EOLs, but I think the coordination between processes is 
simpler here.


- Heikki




Re: Parallel copy

2020-10-30 Thread Heikki Linnakangas

On 27/10/2020 15:36, vignesh C wrote:

Attached v9 patches have the fixes for the above comments.


I find this design to be very complicated. Why does the line-boundary 
information need to be in shared memory? I think this would be much 
simpler if each worker grabbed a fixed-size block of raw data, and 
processed that.


In your patch, the leader process scans the input to find out where one 
line ends and another begins, and because of that decision, the leader 
needs to make the line boundaries available in shared memory, for the 
worker processes. If we moved that responsibility to the worker 
processes, you wouldn't need to keep the line boundaries in shared 
memory. A worker would only need to pass enough state to the next worker 
to tell it where to start scanning the next block.


Whether the leader process finds the EOLs or the worker processes, it's 
pretty clear that it needs to be done ASAP, for a chunk at a time, 
because that cannot be done in parallel. I think some refactoring in 
CopyReadLine() and friends would be in order. It probably would be 
faster, or at least not slower, to find all the EOLs in a block in one 
tight loop, even when parallel copy is not used.


- Heikki




Re: Parallel copy

2020-10-29 Thread Amit Kapila
On Thu, Oct 29, 2020 at 11:45 AM Amit Kapila  wrote:
>
> On Tue, Oct 27, 2020 at 7:06 PM vignesh C  wrote:
> >
> [latest version]
>
> I think the parallel-safety checks in this patch
> (v9-0002-Allow-copy-from-command-to-process-data-from-file) are
> incomplete and wrong.
>

One more point, I have noticed that some time back [1], I have given
one suggestion related to the way workers process the set of lines
(aka chunk). I think you can try by increasing the chunk size to say
100, 500, 1000 and use some shared counter to remember the number of
chunks processed.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1L-Xgw1zZEbGePmhBBWmEmLFL6rCaiOMDPnq2GNMVz-sg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-29 Thread Daniel Westermann (DWE)
On 27/10/2020 15:36, vignesh C wrote:
>> Attached v9 patches have the fixes for the above comments.

>I did some testing:

I did some testing as well and have a cosmetic remark:

postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 10);
ERROR:  value 10 out of bounds for option "parallel"
DETAIL:  Valid values are between "1" and "1024".
postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000);
ERROR:  parallel requires an integer value
postgres=# 

Wouldn't it make more sense to only have one error message? The first one seems 
to be the better message.

Regards
Daniel



Re: Parallel copy

2020-10-29 Thread Heikki Linnakangas

On 27/10/2020 15:36, vignesh C wrote:

Attached v9 patches have the fixes for the above comments.


I did some testing:

/tmp/longdata.pl:

#!/usr/bin/perl
#
# Generate three rows:
# foo
# longdatalongdatalongdata...
# bar
#
# The length of the middle row is given as command line arg.
#

my $bytes = $ARGV[0];

print "foo\n";
for(my $i = 0; $i < $bytes; $i+=8){
print "longdata";
}
print "\n";
print "bar\n";


postgres=# copy longdata from program 'perl /tmp/longdata.pl 1' 
with (parallel 2);


This gets stuck forever (or at least I didn't have the patience to wait 
it finish). Both worker processes are consuming 100% of CPU.


- Heikki




Re: Parallel copy

2020-10-29 Thread Amit Kapila
On Tue, Oct 27, 2020 at 7:06 PM vignesh C  wrote:
>
[latest version]

I think the parallel-safety checks in this patch
(v9-0002-Allow-copy-from-command-to-process-data-from-file) are
incomplete and wrong. See below comments.
1.
+static pg_attribute_always_inline bool
+CheckExprParallelSafety(CopyState cstate)
+{
+ if (contain_volatile_functions(cstate->whereClause))
+ {
+ if (max_parallel_hazard((Query *) cstate->whereClause) != PROPARALLEL_SAFE)
+ return false;
+ }

I don't understand the above check. Why do we only need to check where
clause for parallel-safety when it contains volatile functions? It
should be checked otherwise as well, no? The similar comment applies
to other checks in this function. Also, I don't think there is a need
to make this function inline.

2.
+/*
+ * IsParallelCopyAllowed
+ *
+ * Check if parallel copy can be allowed.
+ */
+bool
+IsParallelCopyAllowed(CopyState cstate)
{
..
+ * When there are BEFORE/AFTER/INSTEAD OF row triggers on the table. We do
+ * not allow parallelism in such cases because such triggers might query
+ * the table we are inserting into and act differently if the tuples that
+ * have already been processed and prepared for insertion are not there.
+ * Now, if we allow parallelism with such triggers the behaviour would
+ * depend on if the parallel worker has already inserted or not that
+ * particular tuples.
+ */
+ if (cstate->rel->trigdesc != NULL &&
+ (cstate->rel->trigdesc->trig_insert_after_statement ||
+ cstate->rel->trigdesc->trig_insert_new_table ||
+ cstate->rel->trigdesc->trig_insert_before_row ||
+ cstate->rel->trigdesc->trig_insert_after_row ||
+ cstate->rel->trigdesc->trig_insert_instead_row))
+ return false;
..

Why do we need to disable parallelism for before/after row triggers
unless they have parallel-unsafe functions? I see a few lines down in
this function you are checking parallel-safety of trigger functions,
what is the use of the same if you are already disabling parallelism
with the above check.

3. What about if the index on table has expressions that are
parallel-unsafe? What is your strategy to check parallel-safety for
partitioned tables?

I suggest checking Greg's patch for parallel-safety of Inserts [1]. I
think you will find that most of those checks are required here as
well and see how we can use that patch (at least what is common). I
feel the first patch should be just to have parallel-safety checks and
we can test that by trying to enable Copy with force_parallel_mode. We
can build the rest of the patch atop of it or in other words, let's
move all parallel-safety work into a separate patch.

Few assorted comments:

1.
+/*
+ * ESTIMATE_NODE_SIZE - Estimate the size required for  node type in shared
+ * memory.
+ */
+#define ESTIMATE_NODE_SIZE(list, listStr, strsize) \
+{ \
+ uint32 estsize = sizeof(uint32); \
+ if ((List *)list != NIL) \
+ { \
+ listStr = nodeToString(list); \
+ estsize += strlen(listStr) + 1; \
+ } \
+ \
+ strsize = add_size(strsize, estsize); \
+}

This can be probably a function instead of a macro.

2.
+/*
+ * ESTIMATE_1BYTE_STR_SIZE - Estimate the size required for  1Byte strings in
+ * shared memory.
+ */
+#define ESTIMATE_1BYTE_STR_SIZE(src, strsize) \
+{ \
+ strsize = add_size(strsize, sizeof(uint8)); \
+ strsize = add_size(strsize, (src) ? 1 : 0); \
+}

This could be an inline function.

3.
+/*
+ * SERIALIZE_1BYTE_STR - Copy 1Byte strings to shared memory.
+ */
+#define SERIALIZE_1BYTE_STR(dest, src, copiedsize) \
+{ \
+ uint8 len = (src) ? 1 : 0; \
+ memcpy(dest + copiedsize, (uint8 *) , sizeof(uint8)); \
+ copiedsize += sizeof(uint8); \
+ if (src) \
+ dest[copiedsize++] = src[0]; \
+}

Similarly, this could be a function. I think keeping such things as
macros in-between code makes it difficult to read. Please see if you
can make these and similar macros as functions unless they are doing
few memory instructions. Having functions makes it easier to debug the
code as well.

[1] - 
https://www.postgresql.org/message-id/CAJcOf-cgfjj0NfYPrNFGmQJxsnNW102LTXbzqxQJuziar1EKfQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




RE: Parallel copy

2020-10-28 Thread Hou, Zhijie
Hi 

I found some issue in v9-0002

1.
+
+   elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d, 
unprocessed lines:%d, offset:%d, line size:%d",
+write_pos, lineInfo->first_block,
+pg_atomic_read_u32(_blk_ptr->unprocessed_line_parts),
+offset, pg_atomic_read_u32(>line_size));
+

write_pos or other variable to be printed here are type of uint32, I think 
it'better to use '%u' in elog msg.

2.
+* line_size will be set. Read the line_size again to be sure 
if it is
+* completed or partial block.
+*/
+   dataSize = pg_atomic_read_u32(>line_size);
+   if (dataSize)

It use dataSize( type int ) to get uint32 which seems a little dangerous.
Is it better to define dataSize uint32 here? 

3.
Since function with  'Cstate' in name has been changed to 'CState'
I think we can change function PopulateCommonCstateInfo as well.

4.
+   if (pcdata->worker_line_buf_count)

I think some check like the above can be 'if (xxx > 0)', which seems easier to 
understand.


Best regards,
houzj




Re: Parallel copy

2020-10-27 Thread vignesh C
On Fri, Oct 23, 2020 at 6:58 PM Ashutosh Sharma  wrote:
>
> >
> > I think, if possible, all these if-else checks in CopyFrom() can be
> > moved to a single function which can probably be named as
> > IdentifyCopyInsertMethod() and this function can be called in
> > IsParallelCopyAllowed(). This will ensure that in case of Parallel
> > Copy when the leader has performed all these checks, the worker won't
> > do it again. I also feel that it will make the code look a bit
> > cleaner.
> >
>
> Just rewriting above comment to make it a bit more clear:
>
> I think, if possible, all these if-else checks in CopyFrom() should be
> moved to a separate function which can probably be named as
> IdentifyCopyInsertMethod() and this function called from
> IsParallelCopyAllowed() and CopyFrom() functions. It will only be
> called from CopyFrom() when IsParallelCopy() returns false. This will
> ensure that in case of Parallel Copy if the leader has performed all
> these checks, the worker won't do it again. I also feel that having a
> separate function containing all these checks will make the code look
> a bit cleaner.
>

In the recent patch posted we have changed it to simplify the check
for parallel copy, it is not an exact match. I feel this comment is
not applicable on the latest patch

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-27 Thread vignesh C
Thanks Ashutosh for reviewing and providing your comments.

On Fri, Oct 23, 2020 at 5:43 PM Ashutosh Sharma  wrote:
>
> Hi Vignesh,
>
> Thanks for the updated patches. Here are some more comments that I can
> find after reviewing your latest patches:
>
> +/*
> + * This structure helps in storing the common data from CopyStateData that 
> are
> + * required by the workers. This information will then be allocated and 
> stored
> + * into the DSM for the worker to retrieve and copy it to CopyStateData.
> + */
> +typedef struct SerializedParallelCopyState
> +{
> +   /* low-level state data */
> +   CopyDestcopy_dest;  /* type of copy source/destination */
> +   int file_encoding;  /* file or remote side's character encoding */
> +   boolneed_transcoding;   /* file encoding diff from server? */
> +   boolencoding_embeds_ascii;  /* ASCII can be non-first byte? */
> +
> ...
> ...
> +
> +   /* Working state for COPY FROM */
> +   AttrNumber  num_defaults;
> +   Oid relid;
> +} SerializedParallelCopyState;
>
> Can the above structure not be part of the CopyStateData structure? I
> am just asking this question because all the fields present in the
> above structure are also present in the CopyStateData structure. So,
> including it in the CopyStateData structure will reduce the code
> duplication and will also make CopyStateData a bit shorter.
>

I have removed the common members from the structure, now there are no
common members between CopyStateData & the new structure. I'm using
CopyStateData to copy to/from directly in the new patch.

> --
>
> +   pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
> +relid);
>
> Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
> function when we are already passing cstate structure, using which
> both of these information can be retrieved ?
>

nworkers need not be passed as you have suggested but relid need to be
passed as we will be setting it to pcdata, modified nworkers as
suggested.

> --
>
> +/* DSM keys for parallel copy.  */
> +#define PARALLEL_COPY_KEY_SHARED_INFO  1
> +#define PARALLEL_COPY_KEY_CSTATE   2
> +#define PARALLEL_COPY_WAL_USAGE3
> +#define PARALLEL_COPY_BUFFER_USAGE 4
>
> DSM key names do not appear to be consistent. For shared info and
> cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
> but for WalUsage and BufferUsage structures, it is prefixed with
> "PARALLEL_COPY". I think it would be better to make them consistent.
>

Modified as suggested

> --
>
> if (resultRelInfo->ri_TrigDesc != NULL &&
> (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
>  resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
> {
> /*
>  * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
>  * triggers on the table. Such triggers might query the table we're
>  * inserting into and act differently if the tuples that have already
>  * been processed and prepared for insertion are not there.
>  */
> insertMethod = CIM_SINGLE;
> }
> else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
>  resultRelInfo->ri_TrigDesc->trig_insert_new_table)
> {
> /*
>  * For partitioned tables we can't support multi-inserts when there
>  * are any statement level insert triggers. It might be possible to
>  * allow partitioned tables with such triggers in the future, but for
>  * now, CopyMultiInsertInfoFlush expects that any before row insert
>  * and statement level insert triggers are on the same relation.
>  */
> insertMethod = CIM_SINGLE;
> }
> else if (resultRelInfo->ri_FdwRoutine != NULL ||
>  cstate->volatile_defexprs)
> {
> ...
> ...
>
> I think, if possible, all these if-else checks in CopyFrom() can be
> moved to a single function which can probably be named as
> IdentifyCopyInsertMethod() and this function can be called in
> IsParallelCopyAllowed(). This will ensure that in case of Parallel
> Copy when the leader has performed all these checks, the worker won't
> do it again. I also feel that it will make the code look a bit
> cleaner.
>

In the recent patch posted we have changed it to simplify the check
for parallel copy, it is not an exact match. I feel this comment is
not applicable on the latest patch

> --
>
> +void
> +ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
> +{
&

Re: Parallel copy

2020-10-27 Thread vignesh C
On Wed, Oct 21, 2020 at 3:50 PM Amit Kapila  wrote:
>
> On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy
>  wrote:
> >
> >
> > 9. Instead of calling CopyStringToSharedMemory() for each string
> > variable, can't we just create a linked list of all the strings that
> > need to be copied into shm and call CopyStringToSharedMemory() only
> > once? We could avoid 5 function calls?
> >
>
> If we want to avoid different function calls then can't we just store
> all these strings in a local structure and use it? That might improve
> the other parts of code as well where we are using these as individual
> parameters.
>

I have made one structure SerializedListToStrCState to store all the
variables. The rest of the common variables is directly copied from &
into cstate.

> > 10. Similar to above comment: can we fill all the required
> > cstate->variables inside the function CopyNodeFromSharedMemory() and
> > call it only once? In each worker we could save overhead of 5 function
> > calls.
> >
>
> Yeah, that makes sense.
>

I feel keeping it this way makes the code more readable, and also this
is not in a performance intensive tight loop. I'm  retaining the
change as is unless we feel this will make an impact.

This is addressed in v9 patch shared at [1].
[1] - 
https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-27 Thread vignesh C
Thanks Heikki for reviewing and providing your comments. Please find
my thoughts below.

On Fri, Oct 23, 2020 at 2:01 PM Heikki Linnakangas  wrote:
>
> I had a brief look at at this patch. Important work! A couple of first
> impressions:
>
> 1. The split between patches
> 0002-Framework-for-leader-worker-in-parallel-copy.patch and
> 0003-Allow-copy-from-command-to-process-data-from-file.patch is quite
> artificial. All the stuff introduced in the first is unused until the
> second patch is applied. The first patch introduces a forward
> declaration for ParallelCopyData(), but the function only comes in the
> second patch. The comments in the first patch talk about
> LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only
> comes in the second patch. I think these have to merged into one. If you
> want to split it somehow, I'd suggest having a separate patch just to
> move CopyStateData from copy.c to copy.h. The subsequent patch would
> then be easier to read as you could see more easily what's being added
> to CopyStateData. Actually I think it would be better to have a new
> header file, copy_internal.h, to hold CopyStateData and the other
> structs, and keep copy.h as it is.
>

I have merged 0002 & 0003 patch, I have moved few things like creation
of copy_internal.h, moving of CopyStateData from copy.c into
copy_internal.h into 0001 patch.

> 2. This desperately needs some kind of a high-level overview of how it
> works. What is a leader, what is a worker? Which process does each step
> of COPY processing, like reading from the file/socket, splitting the
> input into lines, handling escapes, calling input functions, and
> updating the heap and indexes? What data structures are used for the
> communication? How does is the work synchronized between the processes?
> There are comments on those individual aspects scattered in the patch,
> but if you're not already familiar with it, you don't know where to
> start. There's some of that in the commit message, but it needs to be
> somewhere in the source code, maybe in a long comment at the top of
> copyparallel.c.
>

Added it in copyparallel.c

> 3. I'm surprised there's a separate ParallelCopyLineBoundary struct for
> every input line. Doesn't that incur a lot of synchronization overhead?
> I haven't done any testing, this is just my gut feeling, but I assumed
> you'd work in batches of, say, 100 or 1000 lines each.
>

Data read from the file will be stored in DSM which is of size 64k *
1024. Leader will parse and identify the line boundary like which line
starts from which data block, what is the starting offset in the data
block, what is the line size, this information will be present in
ParallelCopyLineBoundary. Like you said, each worker processes
WORKER_CHUNK_COUNT 64 lines at a time. Performance test results run
for parallel copy are available at [1]. This is addressed in v9 patch
shared at [2].

[1] 
https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-27 Thread vignesh C
On Wed, Oct 21, 2020 at 4:20 PM Bharath Rupireddy
 wrote:
>
> On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy
>  wrote:
> >
> > 17. Remove extra lines after #define IsHeaderLine()
> > (cstate->header_line && cstate->cur_lineno == 1) in copy.h
> >
>
>  I missed one comment:
>
>  18. I think we need to treat the number of parallel workers as an
> integer similar to the parallel option in vacuum.
>
> postgres=# copy t1 from stdin with(parallel '1');  < - we
> should not allow this.
> Enter data to be copied followed by a newline.
>
> postgres=# vacuum (parallel '1') t1;
> ERROR:  parallel requires an integer value
>

I have made the behavior the same as vacuum.
This is addressed in v9 patch shared at [1].
[1] - 
https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-23 Thread Ashutosh Sharma
On Fri, Oct 23, 2020 at 5:42 PM Ashutosh Sharma  wrote:
>
> Hi Vignesh,
>
> Thanks for the updated patches. Here are some more comments that I can
> find after reviewing your latest patches:
>
> +/*
> + * This structure helps in storing the common data from CopyStateData that 
> are
> + * required by the workers. This information will then be allocated and 
> stored
> + * into the DSM for the worker to retrieve and copy it to CopyStateData.
> + */
> +typedef struct SerializedParallelCopyState
> +{
> +   /* low-level state data */
> +   CopyDestcopy_dest;  /* type of copy source/destination */
> +   int file_encoding;  /* file or remote side's character encoding */
> +   boolneed_transcoding;   /* file encoding diff from server? */
> +   boolencoding_embeds_ascii;  /* ASCII can be non-first byte? */
> +
> ...
> ...
> +
> +   /* Working state for COPY FROM */
> +   AttrNumber  num_defaults;
> +   Oid relid;
> +} SerializedParallelCopyState;
>
> Can the above structure not be part of the CopyStateData structure? I
> am just asking this question because all the fields present in the
> above structure are also present in the CopyStateData structure. So,
> including it in the CopyStateData structure will reduce the code
> duplication and will also make CopyStateData a bit shorter.
>
> --
>
> +   pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
> +relid);
>
> Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
> function when we are already passing cstate structure, using which
> both of these information can be retrieved ?
>
> --
>
> +/* DSM keys for parallel copy.  */
> +#define PARALLEL_COPY_KEY_SHARED_INFO  1
> +#define PARALLEL_COPY_KEY_CSTATE   2
> +#define PARALLEL_COPY_WAL_USAGE3
> +#define PARALLEL_COPY_BUFFER_USAGE 4
>
> DSM key names do not appear to be consistent. For shared info and
> cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
> but for WalUsage and BufferUsage structures, it is prefixed with
> "PARALLEL_COPY". I think it would be better to make them consistent.
>
> --
>
> if (resultRelInfo->ri_TrigDesc != NULL &&
> (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
>  resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
> {
> /*
>  * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
>  * triggers on the table. Such triggers might query the table we're
>  * inserting into and act differently if the tuples that have already
>  * been processed and prepared for insertion are not there.
>  */
> insertMethod = CIM_SINGLE;
> }
> else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
>  resultRelInfo->ri_TrigDesc->trig_insert_new_table)
> {
> /*
>  * For partitioned tables we can't support multi-inserts when there
>  * are any statement level insert triggers. It might be possible to
>  * allow partitioned tables with such triggers in the future, but for
>  * now, CopyMultiInsertInfoFlush expects that any before row insert
>  * and statement level insert triggers are on the same relation.
>  */
> insertMethod = CIM_SINGLE;
> }
> else if (resultRelInfo->ri_FdwRoutine != NULL ||
>  cstate->volatile_defexprs)
> {
> ...
> ...
>
> I think, if possible, all these if-else checks in CopyFrom() can be
> moved to a single function which can probably be named as
> IdentifyCopyInsertMethod() and this function can be called in
> IsParallelCopyAllowed(). This will ensure that in case of Parallel
> Copy when the leader has performed all these checks, the worker won't
> do it again. I also feel that it will make the code look a bit
> cleaner.
>

Just rewriting above comment to make it a bit more clear:

I think, if possible, all these if-else checks in CopyFrom() should be
moved to a separate function which can probably be named as
IdentifyCopyInsertMethod() and this function called from
IsParallelCopyAllowed() and CopyFrom() functions. It will only be
called from CopyFrom() when IsParallelCopy() returns false. This will
ensure that in case of Parallel Copy if the leader has performed all
these checks, the worker won't do it again. I also feel that having a
separate function containing all these checks will make the code look
a bit cleaner.

> --
>
> +void
> +ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
&

Re: Parallel copy

2020-10-23 Thread Ashutosh Sharma
Hi Vignesh,

Thanks for the updated patches. Here are some more comments that I can
find after reviewing your latest patches:

+/*
+ * This structure helps in storing the common data from CopyStateData that are
+ * required by the workers. This information will then be allocated and stored
+ * into the DSM for the worker to retrieve and copy it to CopyStateData.
+ */
+typedef struct SerializedParallelCopyState
+{
+   /* low-level state data */
+   CopyDestcopy_dest;  /* type of copy source/destination */
+   int file_encoding;  /* file or remote side's character encoding */
+   boolneed_transcoding;   /* file encoding diff from server? */
+   boolencoding_embeds_ascii;  /* ASCII can be non-first byte? */
+
...
...
+
+   /* Working state for COPY FROM */
+   AttrNumber  num_defaults;
+   Oid relid;
+} SerializedParallelCopyState;

Can the above structure not be part of the CopyStateData structure? I
am just asking this question because all the fields present in the
above structure are also present in the CopyStateData structure. So,
including it in the CopyStateData structure will reduce the code
duplication and will also make CopyStateData a bit shorter.

--

+   pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
+relid);

Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
function when we are already passing cstate structure, using which
both of these information can be retrieved ?

--

+/* DSM keys for parallel copy.  */
+#define PARALLEL_COPY_KEY_SHARED_INFO  1
+#define PARALLEL_COPY_KEY_CSTATE   2
+#define PARALLEL_COPY_WAL_USAGE3
+#define PARALLEL_COPY_BUFFER_USAGE 4

DSM key names do not appear to be consistent. For shared info and
cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
but for WalUsage and BufferUsage structures, it is prefixed with
"PARALLEL_COPY". I think it would be better to make them consistent.

--

if (resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
{
/*
 * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
 * triggers on the table. Such triggers might query the table we're
 * inserting into and act differently if the tuples that have already
 * been processed and prepared for insertion are not there.
 */
insertMethod = CIM_SINGLE;
}
else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
 resultRelInfo->ri_TrigDesc->trig_insert_new_table)
{
/*
 * For partitioned tables we can't support multi-inserts when there
 * are any statement level insert triggers. It might be possible to
 * allow partitioned tables with such triggers in the future, but for
 * now, CopyMultiInsertInfoFlush expects that any before row insert
 * and statement level insert triggers are on the same relation.
 */
insertMethod = CIM_SINGLE;
}
else if (resultRelInfo->ri_FdwRoutine != NULL ||
 cstate->volatile_defexprs)
{
...
...

I think, if possible, all these if-else checks in CopyFrom() can be
moved to a single function which can probably be named as
IdentifyCopyInsertMethod() and this function can be called in
IsParallelCopyAllowed(). This will ensure that in case of Parallel
Copy when the leader has performed all these checks, the worker won't
do it again. I also feel that it will make the code look a bit
cleaner.

--

+void
+ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
+{
...
...
+   InstrEndParallelQuery([ParallelWorkerNumber],
+ [ParallelWorkerNumber]);
+
+   MemoryContextSwitchTo(oldcontext);
+   pfree(cstate);
+   return;
+}

It seems like you also need to delete the memory context
(cstate->copycontext) here.

--

+void
+ExecBeforeStmtTrigger(CopyState cstate)
+{
+   EState *estate = CreateExecutorState();
+   ResultRelInfo *resultRelInfo;

This function has a lot of comments which have been copied as it is
from the CopyFrom function, I think it would be good to remove those
comments from here and mention that this code changes done in this
function has been taken from the CopyFrom function. If any queries
people may refer to the CopyFrom function. This will again avoid the
unnecessary code in the patch.

--

As Heikki rightly pointed out in his previous email, we need some high
level description of how Parallel Copy works somewhere in
copyparallel.c file. For reference, please see how a brief description
about parallel vacuum has been added in the vacuumlazy.c file.

 * Lazy vacuum supports parallel execution with parallel worker processes.  In
 * a parallel vacuum, we perform both

Re: Parallel copy

2020-10-23 Thread Heikki Linnakangas
I had a brief look at at this patch. Important work! A couple of first 
impressions:


1. The split between patches 
0002-Framework-for-leader-worker-in-parallel-copy.patch and 
0003-Allow-copy-from-command-to-process-data-from-file.patch is quite 
artificial. All the stuff introduced in the first is unused until the 
second patch is applied. The first patch introduces a forward 
declaration for ParallelCopyData(), but the function only comes in the 
second patch. The comments in the first patch talk about 
LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only 
comes in the second patch. I think these have to merged into one. If you 
want to split it somehow, I'd suggest having a separate patch just to 
move CopyStateData from copy.c to copy.h. The subsequent patch would 
then be easier to read as you could see more easily what's being added 
to CopyStateData. Actually I think it would be better to have a new 
header file, copy_internal.h, to hold CopyStateData and the other 
structs, and keep copy.h as it is.


2. This desperately needs some kind of a high-level overview of how it 
works. What is a leader, what is a worker? Which process does each step 
of COPY processing, like reading from the file/socket, splitting the 
input into lines, handling escapes, calling input functions, and 
updating the heap and indexes? What data structures are used for the 
communication? How does is the work synchronized between the processes? 
There are comments on those individual aspects scattered in the patch, 
but if you're not already familiar with it, you don't know where to 
start. There's some of that in the commit message, but it needs to be 
somewhere in the source code, maybe in a long comment at the top of 
copyparallel.c.


3. I'm surprised there's a separate ParallelCopyLineBoundary struct for 
every input line. Doesn't that incur a lot of synchronization overhead? 
I haven't done any testing, this is just my gut feeling, but I assumed 
you'd work in batches of, say, 100 or 1000 lines each.


- Heikki




Re: Parallel copy

2020-10-21 Thread Bharath Rupireddy
On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy
 wrote:
>
> 17. Remove extra lines after #define IsHeaderLine()
> (cstate->header_line && cstate->cur_lineno == 1) in copy.h
>

 I missed one comment:

 18. I think we need to treat the number of parallel workers as an
integer similar to the parallel option in vacuum.

postgres=# copy t1 from stdin with(parallel '1');  < - we
should not allow this.
Enter data to be copied followed by a newline.

postgres=# vacuum (parallel '1') t1;
ERROR:  parallel requires an integer value


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-21 Thread Amit Kapila
On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy
 wrote:
>
>
> 9. Instead of calling CopyStringToSharedMemory() for each string
> variable, can't we just create a linked list of all the strings that
> need to be copied into shm and call CopyStringToSharedMemory() only
> once? We could avoid 5 function calls?
>

If we want to avoid different function calls then can't we just store
all these strings in a local structure and use it? That might improve
the other parts of code as well where we are using these as individual
parameters.

> 10. Similar to above comment: can we fill all the required
> cstate->variables inside the function CopyNodeFromSharedMemory() and
> call it only once? In each worker we could save overhead of 5 function
> calls.
>

Yeah, that makes sense.

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-21 Thread Bharath Rupireddy
Hi Vignesh,

I took a look at the v8 patch set. Here are some comments:

1. PopulateCommonCstateInfo() -- can we use PopulateCommonCStateInfo()
or PopulateCopyStateInfo()? And also EstimateCstateSize() --
EstimateCStateSize(), PopulateCstateCatalogInfo() --
PopulateCStateCatalogInfo()?

2. Instead of mentioning numbers like 1024, 64K, 10240 in the
comments, can we represent them in terms of macros?
/* It can hold 1024 blocks of 64K data in DSM to be processed by the worker. */
#define MAX_BLOCKS_COUNT 1024
/*
 * It can hold upto 10240 record information for worker to process. RINGSIZE

3. How about
"
Each worker at once will pick the WORKER_CHUNK_COUNT records from the
DSM data blocks and store them in it's local memory.
This is to make workers not contend much while getting record
information from the DSM. Read RINGSIZE comments before
 changing this value.
"
instead of
/*
 * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
 * block to process to avoid lock contention. Read RINGSIZE comments before
 * changing this value.
 */

4.  How about one line gap before and after for comments: "Leader
should operate in the following order:" and "Worker should operate in
the following order:"

5. Can we move RAW_BUF_BYTES macro definition to the beginning of the
copy.h where all the macro are defined?

6. I don't think we need the change in toast_internals.c with the
temporary hack Assert(!(IsParallelWorker() && !currentCommandIdUsed));
in GetCurrentCommandId()

7. I think
/* Can't perform copy in parallel */
if (parallel_workers <= 0)
return NULL;
can be
/* Can't perform copy in parallel */
if (parallel_workers == 0)
return NULL;
as parallel_workers can never be < 0 since we enter BeginParallelCopy
only if cstate->nworkers > 0 and also we are not allowed to have
negative values for max_worker_processes.

8. Do we want to pfree(cstate->pcdata) in case we failed to start any
parallel workers, we would have allocated a good
else
{
/*
 * Reset nworkers to -1 here. This is useful in cases where user
 * specifies parallel workers, but, no worker is picked up, so go
 * back to non parallel mode value of nworkers.
 */
cstate->nworkers = -1;
*processed = CopyFrom(cstate);/* copy from file to database */
}

9. Instead of calling CopyStringToSharedMemory() for each string
variable, can't we just create a linked list of all the strings that
need to be copied into shm and call CopyStringToSharedMemory() only
once? We could avoid 5 function calls?

10. Similar to above comment: can we fill all the required
cstate->variables inside the function CopyNodeFromSharedMemory() and
call it only once? In each worker we could save overhead of 5 function
calls.

11. Looks like CopyStringFromSharedMemory() and
CopyNodeFromSharedMemory() do almost the same things except
stringToNode() and pfree(destptr);. Can we have a generic function
CopyFromSharedMemory() or something else and handle with flag "bool
isnode" to differentiate the two use cases?

12. Can we move below check to the end in IsParallelCopyAllowed()?
/* Check parallel safety of the trigger functions. */
if (cstate->rel->trigdesc != NULL &&
!CheckRelTrigFunParallelSafety(cstate->rel->trigdesc))
return false;

13. CacheLineInfo(): Instead of goto empty_data_line_update; how about
having this directly inside the if block as it's being used only once?

14. GetWorkerLine(): How about avoiding goto statements and replacing
the common code with a always static inline function or a macro?

15. UpdateSharedLineInfo(): Below line is misaligned.
lineInfo->first_block = blk_pos;
lineInfo->start_offset = offset;

16. ParallelCopyFrom(): Do we need CHECK_FOR_INTERRUPTS(); at the
start of  for (;;)?

17. Remove extra lines after #define IsHeaderLine()
(cstate->header_line && cstate->cur_lineno == 1) in copy.h

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-21 Thread vignesh C
On Thu, Oct 8, 2020 at 11:15 AM vignesh C  wrote:
>
> I'm summarizing the pending open points so that I don't miss anything:
> 1) Performance test on latest patch set.

It is tested and results are shared by bharath at [1]

> 2) Testing points suggested.

Tests are added as suggested and details shared by bharath at [1]

> 3) Support of parallel copy for COPY_OLD_FE.

It is handled as part of v8 patch shared at [2]

> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.

Open

> 5) Handling of Tomas's comments.

I have fixed and updated the fix details as part of [3]

> 6) Handling of Greg's comments.

I have fixed and updated the fix details as part of [4]

Except for "4) Worker has to hop through all the processed chunks before
getting the chunk which it can process", all open tasks are handled. I will
work on this and provide an update shortly.

[1]
https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/caldanm2ucmcmozcbkl8b7az9oyd9hz+fndczhssiiqj4v-x...@mail.gmail.com
[3]
https://www.postgresql.org/message-id/CALDaNm0_zUa9%2BS%3DpwCz3Yp43SY3r9bnO4v-9ucXUujEE%3D0Sd7g%40mail.gmail.com
[4]
https://www.postgresql.org/message-id/CALDaNm31pGG%2BL9N4HbM0mO4iuceih4mJ5s87jEwOPaFLpmDKyQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-10-20 Thread Bharath Rupireddy
On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:
>
> On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila 
wrote:
> >
> > 2. Do we have tests for toast tables? I think if you implement the
> > previous point some existing tests might cover it but I feel we should
> > have at least one or two tests for the same.
> >
> Toast table use case 1: 1 tuples, 9.6GB data, 3 indexes 2 on integer
columns, 1 on text column(not the toast column), csv file, each row is >
1320KB:
> (222.767, 0, 1X), (134.171, 1, 1.66X), (93.749, 2, 2.38X), (93.672, 4,
2.38X), (94.827, 8, 2.35X), (93.766, 16, 2.37X), (98.153, 20, 2.27X),
(122.721, 30, 1.81X)
>
> Toast table use case 2: 10 tuples, 96GB data, 3 indexes 2 on integer
columns, 1 on text column(not the toast column), csv file, each row is >
1320KB:
> (2255.032, 0, 1X), (1358.628, 1, 1.66X), (901.170, 2, 2.5X), (912.743, 4,
2.47X), (988.718, 8, 2.28X), (938.000, 16, 2.4X), (997.556, 20, 2.26X),
(1000.586, 30, 2.25X)
>
> Toast table use case3: 1 tuples, 9.6GB, no indexes, binary file, each
row is > 1320KB:
> (136.983, 0, 1X), (136.418, 1, 1X), (81.896, 2, 1.66X), (62.929, 4,
2.16X), (52.311, 8, 2.6X), (40.032, 16, 3.49X), (44.097, 20, 3.09X),
(62.310, 30, 2.18X)
>
> In the case of a Toast table, we could achieve upto 2.5X for csv files,
and 3.5X for binary files. We are analyzing this point and will post an
update on our findings soon.
>

I analyzed the above point of getting only upto 2.5X performance
improvement for csv files with a toast table with 3 indexers - 2 on integer
columns and 1 on text column(not the toast column). Reason is that workers
are fast enough to do the work and they are waiting for the leader to fill
in the data blocks and in this case the leader is able to serve the workers
at its maximum possible speed. Hence most of the time the workers are
waiting not doing any beneficial work.

Having observed the above point, I tried to make workers perform more work
to avoid waiting time. For this, I added a gist index on the toasted text
column. The use and results are as follows.

Toast table use case4: 1 tuples, 9.6GB, 4 indexes - 2 on integer
columns, 1 on non-toasted text column and 1 gist index on toasted text
column, csv file, each row is  ~ 12.2KB:

(1322.839, 0, 1X), (1261.176, 1, 1.05X), (632.296, 2, 2.09X), (321.941, 4,
4.11X), (181.796, 8, 7.27X), *(105.750, 16, 12.51X)*, (107.099, 20,
12.35X), (123.262, 30, 10.73X)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-10-19 Thread Amit Kapila
On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie  wrote:
>
> Hi Vignesh,
>
> After having a look over the patch,
> I have some suggestions for
> 0003-Allow-copy-from-command-to-process-data-from-file.patch.
>
> 1.
>
> +static uint32
> +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List 
> *attnamelist,
> +  char **whereClauseStr, char 
> **rangeTableStr,
> +  char **attnameListStr, char 
> **notnullListStr,
> +  char **nullListStr, char **convertListStr)
> +{
> +   uint32  strsize = 
> MAXALIGN(sizeof(SerializedParallelCopyState));
> +
> +   strsize += EstimateStringSize(cstate->null_print);
> +   strsize += EstimateStringSize(cstate->delim);
> +   strsize += EstimateStringSize(cstate->quote);
> +   strsize += EstimateStringSize(cstate->escape);
>
>
> It use function EstimateStringSize to get the strlen of null_print, delim, 
> quote and escape.
> But the length of null_print seems has been stored in null_print_len.
> And delim/quote/escape must be 1 byte, so I think call strlen again seems 
> unnecessary.
>
> How about  " strsize += sizeof(uint32) + cstate->null_print_len + 1"
>

+1. This seems like a good suggestion but add comments for
delim/quote/escape to indicate that we are considering one-byte for
each. I think this will obviate the need of function
EstimateStringSize. Another thing in this regard is that we normally
use add_size function to compute the size but I don't see that being
used in this and nearby computation. That helps us to detect overflow
of addition if any.

EstimateCstateSize()
{
..
+
+ strsize++;
..
}

Why do we need this additional one-byte increment? Does it make sense
to add a small comment for the same?

> 2.
> +   strsize += EstimateNodeSize(cstate->whereClause, whereClauseStr);
>
> +   copiedsize += CopyStringToSharedMemory(cstate, whereClauseStr,
> + 
>  shmptr + copiedsize);
>
> Some string length is counted for two times.
> The ' whereClauseStr ' has call strlen in EstimateNodeSize once and call 
> strlen in CopyStringToSharedMemory again.
> I don't know wheather it's worth to refacor the code to avoid duplicate 
> strlen . what do you think ?
>

It doesn't seem worth to me. We probably need to use additional
variables to save those lengths. I think it will add more
code/complexity than we will save. See EstimateParamListSpace and
SerializeParamList where we get the typeLen each time, that way code
looks neat to me and we are don't going to save much by not following
a similar thing here.

-- 
With Regards,
Amit Kapila.




RE: Parallel copy

2020-10-17 Thread Hou, Zhijie
Hi Vignesh,

After having a look over the patch,
I have some suggestions for 
0003-Allow-copy-from-command-to-process-data-from-file.patch.

1.

+static uint32
+EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist,
+  char **whereClauseStr, char **rangeTableStr,
+  char **attnameListStr, char **notnullListStr,
+  char **nullListStr, char **convertListStr)
+{
+   uint32  strsize = MAXALIGN(sizeof(SerializedParallelCopyState));
+
+   strsize += EstimateStringSize(cstate->null_print);
+   strsize += EstimateStringSize(cstate->delim);
+   strsize += EstimateStringSize(cstate->quote);
+   strsize += EstimateStringSize(cstate->escape);


It use function EstimateStringSize to get the strlen of null_print, delim, 
quote and escape.
But the length of null_print seems has been stored in null_print_len.
And delim/quote/escape must be 1 byte, so I think call strlen again seems 
unnecessary.

How about  " strsize += sizeof(uint32) + cstate->null_print_len + 1"

2.
+   strsize += EstimateNodeSize(cstate->whereClause, whereClauseStr);

+   copiedsize += CopyStringToSharedMemory(cstate, whereClauseStr,
+   
   shmptr + copiedsize);

Some string length is counted for two times.
The ' whereClauseStr ' has call strlen in EstimateNodeSize once and call strlen 
in CopyStringToSharedMemory again.
I don't know wheather it's worth to refacor the code to avoid duplicate strlen 
. what do you think ?

Best regards,
houzj










Re: Parallel copy

2020-10-15 Thread Amit Kapila
On Wed, Oct 14, 2020 at 6:51 PM vignesh C  wrote:
>
> On Fri, Oct 9, 2020 at 11:01 AM Amit Kapila  wrote:
> >
> > I am not able to properly parse the data but If understand the wal
> > data for non-parallel (1116 |   0 |   3587203) and parallel (1119
> > |   6 |   3624405) case doesn't seem to be the same. Is that
> > right? If so, why? Please ensure that no checkpoint happens for both
> > cases.
> >
>
> I have disabled checkpoint, the results with the checkpoint disabled
> are given below:
>| wal_records | wal_fpi | wal_bytes
> Sequential Copy   | 1116|   0   |   3587669
> Parallel Copy(1 worker) | 1116|   0   |   3587669
> Parallel Copy(4 worker) | 1121|   0   |   3587668
> I noticed that for 1 worker wal_records & wal_bytes are same as
> sequential copy, but with different worker count I had noticed that
> there is difference in wal_records & wal_bytes, I think the difference
> should be ok because with more than 1 worker the order of records
> processed will be different based on which worker picks which records
> to process from input file. In the case of sequential copy/1 worker
> the order in which the records will be processed is always in the same
> order hence wal_bytes are the same.
>

Are all records of the same size in your test? If so, then why the
order should matter? Also, even the number of wal_records has
increased but wal_bytes are not increased, rather it is one-byte less.
Can we identify what is going on here? I don't intend to say that it
is a problem but we should know the reason clearly.

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-14 Thread vignesh C
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow  wrote:
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C  wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following
patch:
>
> v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) , sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> memcpy(destptr, , sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>
> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>

I'm not handling this, this is similar to how it is handled in other places.

> Then you could safely use:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> *(uint16 *)destptr = len;
> *copiedsize += sizeof(uint16);
> if (len)
> {
> memcpy(destptr + sizeof(uint16), srcPtr, len);
> *copiedsize += len;
> }
>
> and in the CopyStringFromSharedMemory() function, then could safely use:
>
> len = *(uint16 *)srcPtr;
>
> The compiler may be smart enough to optimize-away the memcpy() in this
> case anyway, but there are issues in doing this for architectures that
> take a performance hit for unaligned access, or don't support
> unaligned access.

Changed it to uin32, so that there are no issues in case if length exceeds
65535 & also to avoid problems in Big Endian architecture.

> Also, in CopyFromSharedMemory() functions, you should use palloc()
> instead of palloc0(), as you're filling the entire palloc'd buffer
> anyway, so no need to ask for additional MemSet() of all buffer bytes
> to 0 prior to memcpy().
>

I have changed palloc0 to palloc.

Thanks Greg for reviewing & providing your comments. These changes are
fixed in one of my earlier mail [1] that I sent.
[1]
https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-10-14 Thread vignesh C
On Sat, Oct 3, 2020 at 6:20 AM Tomas Vondra 
wrote:
>
> Hello Vignesh,
>
> I've done some basic benchmarking on the v4 version of the patches (but
> AFAIKC the v5 should perform about the same), and some initial review.
>
> For the benchmarking, I used the lineitem table from TPC-H - for 75GB
> data set, this largest table is about 64GB once loaded, with another
> 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and
> NVME storage.
>
> The COPY duration with varying number of workers (specified using the
> parallel COPY option) looks like this:
>
>   workersduration
>  -
> 01366
> 11255
> 2 704
> 3 526
> 4 434
> 5 385
> 6 347
> 7 322
> 8 327
>
> So this seems to work pretty well - initially we get almost linear
> speedup, then it slows down (likely due to contention for locks, I/O
> etc.). Not bad.

Thanks for testing with different workers & posting the results.

> I've only done a quick review, but overall the patch looks in fairly
> good shape.
>
> 1) I don't quite understand why we need INCREMENTPROCESSED and
> RETURNPROCESSED, considering it just does ++ or return. It just
> obfuscated the code, I think.
>

I have removed the macros.

> 2) I find it somewhat strange that BeginParallelCopy can just decide not
> to do parallel copy after all. Why not to do this decisions in the
> caller? Or maybe it's fine this way, not sure.
>

I have moved the check IsParallelCopyAllowed to the caller.

> 3) AFAIK we don't modify typedefs.list in patches, so these changes
> should be removed.
>

I had seen that in many of the commits typedefs.list is getting changed,
also it helps in running pgindent. So I'm retaining this change.

> 4) IsTriggerFunctionParallelSafe actually checks all triggers, not just
> one, so the comment needs minor rewording.
>

Modified the comments.

Thanks for the comments & sharing the test results Tomas, These changes are
fixed in one of my earlier mail [1] that I sent.

[1]
https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-10-14 Thread vignesh C
ing:
> > postgres=# select * from pg_stat_statements where query like '%copy%';
> >  userid | dbid  |   queryid|
> >  query
> >| plans | total_plan_time |
> > min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time |
> > calls | total_exec_time | min_exec_time | max_exec_time |
> > mean_exec_time | stddev_exec_time |  rows  | shared_blks_hi
> > t | shared_blks_read | shared_blks_dirtied | shared_blks_written |
> > local_blks_hit | local_blks_read | local_blks_dirtied |
> > local_blks_written | temp_blks_read | temp_blks_written | blk_
> > read_time | blk_write_time | wal_records | wal_fpi | wal_bytes
> > +---+--+-+---+-+-
> > --+---++--+---+-+---+---++--++---
> > --+--+-+-++-++++---+-
> > --++-+-+---
> >  10 | 13743 | -6947756673093447609 | copy hw from
> > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> > csv, delimiter ',')   | 0 |   0 |
> > 0 | 0 |  0 |0 |
> >  1 |  265.195105 |265.195105 |265.195105 | 265.195105
> > |0 | 175000 |191
> > 6 |0 | 946 | 946 |
> >  0 |   0 |  0 |  0
> > |  0 | 0 |
> > 0 |  0 |1116 |   0 |   3587203
> >  10 | 13743 |  8570215596364326047 | copy hw from
> > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> > csv, delimiter ',', parallel '2') | 0 |   0 |
> > 0 | 0 |  0 |0 |
> >  1 |35668.402482 |  35668.402482 |  35668.402482 |   35668.402482
> > |    0 | 175000 |310
> > 1 |   36 |     952 | 919 |
> >  0 |   0 |  0 |  0
> > |  0 | 0 |
> > 0 |  0 |1119 |   6 |   3624405
> > (2 rows)
> >
>
> I am not able to properly parse the data but If understand the wal
> data for non-parallel (1116 |   0 |   3587203) and parallel (1119
> |   6 |   3624405) case doesn't seem to be the same. Is that
> right? If so, why? Please ensure that no checkpoint happens for both
> cases.
>

I have disabled checkpoint, the results with the checkpoint disabled
are given below:
   | wal_records | wal_fpi | wal_bytes
Sequential Copy   | 1116|   0   |   3587669
Parallel Copy(1 worker) | 1116|   0   |   3587669
Parallel Copy(4 worker) | 1121|   0   |   3587668
I noticed that for 1 worker wal_records & wal_bytes are same as
sequential copy, but with different worker count I had noticed that
there is difference in wal_records & wal_bytes, I think the difference
should be ok because with more than 1 worker the order of records
processed will be different based on which worker picks which records
to process from input file. In the case of sequential copy/1 worker
the order in which the records will be processed is always in the same
order hence wal_bytes are the same.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-14 Thread vignesh C
On Fri, Oct 9, 2020 at 10:42 AM Amit Kapila  wrote:
>
> On Thu, Oct 8, 2020 at 12:14 AM vignesh C  wrote:
> >
> > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila 
wrote:
> > >
> > >
> > > I am convinced by the reason given by Kyotaro-San in that another
> > > thread [1] and performance data shown by Peter that this can't be an
> > > independent improvement and rather in some cases it can do harm. Now,
> > > if you need it for a parallel-copy path then we can change it
> > > specifically to the parallel-copy code path but I don't understand
> > > your reason completely.
> > >
> >
> > Whenever we need data to be populated, we will get a new data block &
> > pass it to CopyGetData to populate the data. In case of file copy, the
> > server will completely fill the data block. We expect the data to be
> > filled completely. If data is available it will completely load the
> > complete data block in case of file copy. There is no scenario where
> > even if data is present a partial data block will be returned except
> > for EOF or no data available. But in case of STDIN data copy, even
> > though there is 8K data available in data block & 8K data available in
> > STDIN, CopyGetData will return as soon as libpq buffer data is more
> > than the minread. We will pass new data block every time to load data.
> > Every time we pass an 8K data block but CopyGetData loads a few bytes
> > in the new data block & returns. I wanted to keep the same data
> > population logic for both file copy & STDIN copy i.e copy full 8K data
> > blocks & then the populated data can be required. There is an
> > alternative solution I can have some special handling in case of STDIN
> > wherein the existing data block can be passed with the index from
> > where the data should be copied. Thoughts?
> >
>
> What you are proposing as an alternative solution, isn't that what we
> are doing without the patch? IIUC, you require this because of your
> corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that
> right? If so, what is the difficulty in making it behave similar to
> the non-parallel case?
>

The alternate solution is similar to how existing copy handles STDIN
copies, I have made changes in the v7 patch attached in [1] to have
parallel copy handle STDIN data similar to non parallel copy, so the
original comment on why this change is required has been removed from 001
patch:
> > + if (cstate->copy_dest == COPY_NEW_FE)
> > + minread = RAW_BUF_SIZE - nbytes;
> > +
> >   inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
> > -   1, RAW_BUF_SIZE - nbytes);
> > +   minread, RAW_BUF_SIZE - nbytes);
> >
> > No comment to explain why this change is done?

[1]
https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-10-14 Thread Bharath Rupireddy
I did performance testing on v7 patch set[1] with custom
postgresql.conf[2]. The results are of the triplet form (exec time in
sec, number of workers, gain)

Use case 1: 10million rows, 5.2GB data, 2 indexes on integer columns,
1 index on text column, binary file
(1104.898, 0, 1X), (1112.221, 1, 1X), (640.236, 2, 1.72X), (335.090,
4, 3.3X), (200.492, 8, 5.51X), (131.448, 16, 8.4X), (121.832, 20,
9.1X), (124.287, 30, 8.9X)

Use case 2: 10million rows, 5.2GB data,2 indexes on integer columns, 1
index on text column, copy from stdin, csv format
(1203.282, 0, 1X), (1135.517, 1, 1.06X), (655.140, 2, 1.84X),
(343.688, 4, 3.5X), (203.742, 8, 5.9X), (144.793, 16, 8.31X),
(133.339, 20, 9.02X), (136.672, 30, 8.8X)

Use case 3: 10million rows, 5.2GB data,2 indexes on integer columns, 1
index on text column, text file
(1165.991, 0, 1X), (1128.599, 1, 1.03X), (644.793, 2, 1.81X),
(342.813, 4, 3.4X), (204.279, 8, 5.71X), (139.986, 16, 8.33X),
(128.259, 20, 9.1X), (132.764, 30, 8.78X)

Above results are similar to the results with earlier versions of the patch set.

On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila  wrote:
>
> Sure, you need to change the code such that when force_parallel_mode =
> 'regress' is specified then it always uses one worker. This is
> primarily for testing purposes and will help during the development of
> this patch as it will make all exiting Copy tests to use quite a good
> portion of the parallel infrastructure.
>

I performed force_parallel_mode = regress testing and found 2 issues,
the fixes for the same are available in v7 patch set[1].

>
> > Overall, we have below test cases to cover the code and for performance 
> > measurements. We plan to run these tests whenever a new set of patches is 
> > posted.
> >
> > 1. csv
> > 2. binary
>
> Don't we need the tests for plain text files as well?
>

I added a text use case and above mentioned are perf results on v7 patch set[1].

>
> > 3. force parallel mode = regress
> > 4. toast data csv and binary
> > 5. foreign key check, before row, after row, before statement, after 
> > statement, instead of triggers
> > 6. partition case
> > 7. foreign partitions and partitions having trigger cases
> > 8. where clause having parallel unsafe and safe expression, default 
> > parallel unsafe and safe expression
> > 9. temp, global, local, unlogged, inherited tables cases, foreign tables
> >
>
> Sounds like good coverage. So, are you doing all this testing
> manually? How are you maintaining these tests?
>

All test cases listed above, except for the cases that are meant to
measure perf gain with huge data, are present in v7-0005 patch in v7
patch set[1].

[1] 
https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

[2]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-09 Thread Amit Kapila
On Fri, Oct 9, 2020 at 3:50 PM Bharath Rupireddy
 wrote:
>
> On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila  wrote:
> >
> > On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  
> > > wrote:
> > > >
> > > > From the testing perspective,
> > > > 1. Test by having something force_parallel_mode = regress which means
> > > > that all existing Copy tests in the regression will be executed via
> > > > new worker code. You can have this as a test-only patch for now and
> > > > make sure all existing tests passed with this.
> > > >
> > >
> > > I don't think all the existing copy test cases(except the new test cases 
> > > added in the parallel copy patch set) would run inside the parallel 
> > > worker if force_parallel_mode is on. This is because, the parallelism 
> > > will be picked up for parallel copy only if parallel option is specified 
> > > unlike parallelism for select queries.
> > >
> >
> > Sure, you need to change the code such that when force_parallel_mode =
> > 'regress' is specified then it always uses one worker. This is
> > primarily for testing purposes and will help during the development of
> > this patch as it will make all exiting Copy tests to use quite a good
> > portion of the parallel infrastructure.
> >
>
> IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS
> as default value in guc.c,
>

No need to set this as the default value. You can change it in
postgresql.conf before running tests.

> and then adjust the parallelism related
> code in copy.c such that it always picks 1 worker and spawns it. This
> way, all the existing copy test cases would be run in parallel worker.
> Please let me know if this is okay.
>

Yeah, this sounds fine.

> If yes, I will do this and update
> here.
>

Okay, thanks, but ensure the difference in test execution before and
after your change. After your change, all the 'copy' tests should
invoke the worker to perform a copy.

> >
> > > All the above tests are performed on the latest v6 patch set (attached 
> > > here in this thread) with custom postgresql.conf[1]. The results are of 
> > > the triplet form (exec time in sec, number of workers, gain)
> > >
> >
> > Okay, so I am assuming the performance is the same as we have seen
> > with the earlier versions of patches.
> >
>
> Yes. Most recent run on v5 patch set [1]
>

Okay, good to know that.

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-09 Thread Bharath Rupireddy
On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila  wrote:
>
> On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  wrote:
> > >
> > > From the testing perspective,
> > > 1. Test by having something force_parallel_mode = regress which means
> > > that all existing Copy tests in the regression will be executed via
> > > new worker code. You can have this as a test-only patch for now and
> > > make sure all existing tests passed with this.
> > >
> >
> > I don't think all the existing copy test cases(except the new test cases 
> > added in the parallel copy patch set) would run inside the parallel worker 
> > if force_parallel_mode is on. This is because, the parallelism will be 
> > picked up for parallel copy only if parallel option is specified unlike 
> > parallelism for select queries.
> >
>
> Sure, you need to change the code such that when force_parallel_mode =
> 'regress' is specified then it always uses one worker. This is
> primarily for testing purposes and will help during the development of
> this patch as it will make all exiting Copy tests to use quite a good
> portion of the parallel infrastructure.
>

IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS
as default value in guc.c, and then adjust the parallelism related
code in copy.c such that it always picks 1 worker and spawns it. This
way, all the existing copy test cases would be run in parallel worker.
Please let me know if this is okay. If yes, I will do this and update
here.

>
> > All the above tests are performed on the latest v6 patch set (attached here 
> > in this thread) with custom postgresql.conf[1]. The results are of the 
> > triplet form (exec time in sec, number of workers, gain)
> >
>
> Okay, so I am assuming the performance is the same as we have seen
> with the earlier versions of patches.
>

Yes. Most recent run on v5 patch set [1]

>
> > Overall, we have below test cases to cover the code and for performance 
> > measurements. We plan to run these tests whenever a new set of patches is 
> > posted.
> >
> > 1. csv
> > 2. binary
>
> Don't we need the tests for plain text files as well?
>

Will add one.

>
> > 3. force parallel mode = regress
> > 4. toast data csv and binary
> > 5. foreign key check, before row, after row, before statement, after 
> > statement, instead of triggers
> > 6. partition case
> > 7. foreign partitions and partitions having trigger cases
> > 8. where clause having parallel unsafe and safe expression, default 
> > parallel unsafe and safe expression
> > 9. temp, global, local, unlogged, inherited tables cases, foreign tables
> >
>
> Sounds like good coverage. So, are you doing all this testing
> manually? How are you maintaining these tests?
>

Yes, running them manually. Few of the tests(1,2,4) require huge
datasets for performance measurements and other test cases are to
ensure we don't choose parallelism. We will try to add test cases that
are not meant for performance, to the patch test.

[1] - 
https://www.postgresql.org/message-id/CALj2ACW%3Djm5ri%2B7rXiQaFT_c5h2rVS%3DcJOQVFR5R%2Bbowt3QDkw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-09 Thread Amit Kapila
On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
 wrote:
>
> On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  wrote:
> >
> > From the testing perspective,
> > 1. Test by having something force_parallel_mode = regress which means
> > that all existing Copy tests in the regression will be executed via
> > new worker code. You can have this as a test-only patch for now and
> > make sure all existing tests passed with this.
> >
>
> I don't think all the existing copy test cases(except the new test cases 
> added in the parallel copy patch set) would run inside the parallel worker if 
> force_parallel_mode is on. This is because, the parallelism will be picked up 
> for parallel copy only if parallel option is specified unlike parallelism for 
> select queries.
>

Sure, you need to change the code such that when force_parallel_mode =
'regress' is specified then it always uses one worker. This is
primarily for testing purposes and will help during the development of
this patch as it will make all exiting Copy tests to use quite a good
portion of the parallel infrastructure.

>
> All the above tests are performed on the latest v6 patch set (attached here 
> in this thread) with custom postgresql.conf[1]. The results are of the 
> triplet form (exec time in sec, number of workers, gain)
>

Okay, so I am assuming the performance is the same as we have seen
with the earlier versions of patches.

> Overall, we have below test cases to cover the code and for performance 
> measurements. We plan to run these tests whenever a new set of patches is 
> posted.
>
> 1. csv
> 2. binary

Don't we need the tests for plain text files as well?

> 3. force parallel mode = regress
> 4. toast data csv and binary
> 5. foreign key check, before row, after row, before statement, after 
> statement, instead of triggers
> 6. partition case
> 7. foreign partitions and partitions having trigger cases
> 8. where clause having parallel unsafe and safe expression, default parallel 
> unsafe and safe expression
> 9. temp, global, local, unlogged, inherited tables cases, foreign tables
>

Sounds like good coverage. So, are you doing all this testing
manually? How are you maintaining these tests?

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-09 Thread Greg Nancarrow
On Fri, Oct 9, 2020 at 5:40 PM Amit Kapila  wrote:
>
> > Looking a bit deeper into this, I'm wondering if in fact your
> > EstimateStringSize() and EstimateNodeSize() functions should be using
> > BUFFERALIGN() for EACH stored string/node (rather than just calling
> > shm_toc_estimate_chunk() once at the end, after the length of packed
> > strings and nodes has been estimated), to ensure alignment of start of
> > each string/node. Other Postgres code appears to be aligning each
> > stored chunk using shm_toc_estimate_chunk(). See the definition of
> > that macro and its current usages.
> >
>
> I am not sure if this required for the purpose of correctness. AFAIU,
> we do store/estimate multiple parameters in same way at other places,
> see EstimateParamListSpace and SerializeParamList. Do you have
> something else in mind?
>

The point I was trying to make is that potentially more efficient code
can be used if the individual strings/nodes are aligned, rather than
packed (as they are now), but as you point out, there are already
cases (e.g. SerializeParamList) where within the separately-aligned
chunks the data is not aligned, so maybe not a big deal. Oh well,
without alignment, that means use of memcpy() cannot really be avoided
here for serializing/de-serializing ints etc., let's hope the compiler
optimizes it as best it can.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Parallel copy

2020-10-09 Thread Amit Kapila
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow  wrote:
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C  wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following patch:
>
> v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) , sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> memcpy(destptr, , sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>

Your suggestion makes sense to me if the assumption related to string
length is correct. If we can't ensure that then we need to probably
use four bytes uint32 to store the length.

> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>

I am not sure if this required for the purpose of correctness. AFAIU,
we do store/estimate multiple parameters in same way at other places,
see EstimateParamListSpace and SerializeParamList. Do you have
something else in mind?

While looking at the latest code, I observed below issue in patch
v6-0003-Allow-copy-from-command-to-process-data-from-file:

+ /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */
+ est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState));
+ shm_toc_estimate_chunk(>estimator, est_cstateshared);
+ shm_toc_estimate_keys(>estimator, 1);
+
+ strsize = EstimateCstateSize(pcxt, cstate, attnamelist, ,
+ , ,
+ , ,
+ );

Here, do we need to separately estimate the size of
SerializedParallelCopyState when it is also done in
EstimateCstateSize?

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-08 Thread Amit Kapila
On Thu, Oct 8, 2020 at 12:14 AM vignesh C  wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila  wrote:
> > > > + */
> > > > +typedef struct ParallelCopyLineBoundary
> > > >
> > > > Are we doing all this state management to avoid using locks while
> > > > processing lines?  If so, I think we can use either spinlock or LWLock
> > > > to keep the main patch simple and then provide a later patch to make
> > > > it lock-less.  This will allow us to first focus on the main design of
> > > > the patch rather than trying to make this datastructure processing
> > > > lock-less in the best possible way.
> > > >
> > >
> > > The steps will be more or less same if we use spinlock too. step 1, step 
> > > 3 & step 4 will be common we have to use lock & unlock instead of step 2 
> > > & step 5. I feel we can retain the current implementation.
> > >
> >
> > I'll study this in detail and let you know my opinion on the same but
> > in the meantime, I don't follow one part of this comment: "If they
> > don't follow this order the worker might process wrong line_size and
> > leader might populate the information which worker has not yet
> > processed or in the process of processing."
> >
> > Do you want to say that leader might overwrite some information which
> > worker hasn't read yet? If so, it is not clear from the comment.
> > Another minor point about this comment:
> >
>
> Here leader and worker must follow these steps to avoid any corruption
> or hang issue. Changed it to:
>  * The leader & worker process access the shared line information by following
>  * the below steps to avoid any data corruption or hang:
>

Actually, I wanted more on the lines why such corruption or hang can
happen? It might help reviewers to understand why you have followed
such a sequence.

> >
> > How did you ensure that this is fixed? Have you tested it, if so
> > please share the test? I see a basic problem with your fix.
> >
> > + /* Report WAL/buffer usage during parallel execution */
> > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false);
> > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false);
> > + InstrEndParallelQuery([ParallelWorkerNumber],
> > +   [ParallelWorkerNumber]);
> >
> > You need to call InstrStartParallelQuery() before the actual operation
> > starts, without that stats won't be accurate? Also, after calling
> > WaitForParallelWorkersToFinish(), you need to accumulate the stats
> > collected from workers which neither you have done nor is possible
> > with the current code in your patch because you haven't made any
> > provision to capture them in BeginParallelCopy.
> >
> > I suggest you look into lazy_parallel_vacuum_indexes() and
> > begin_parallel_vacuum() to understand how the buffer/wal usage stats
> > are accumulated. Also, please test this functionality using
> > pg_stat_statements.
> >
>
> Made changes accordingly.
> I have verified it using:
> postgres=# select * from pg_stat_statements where query like '%copy%';
>  userid | dbid  |   queryid|
>  query
>| plans | total_plan_time |
> min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time |
> calls | total_exec_time | min_exec_time | max_exec_time |
> mean_exec_time | stddev_exec_time |  rows  | shared_blks_hi
> t | shared_blks_read | shared_blks_dirtied | shared_blks_written |
> local_blks_hit | local_blks_read | local_blks_dirtied |
> local_blks_written | temp_blks_read | temp_blks_written | blk_
> read_time | blk_write_time | wal_records | wal_fpi | wal_bytes
> +---+--+-+---+-+-
> --+---++--+---+-+---+---++--++---
> --+--+-+-++-++++---+-
> --++-+-+---
>  10 | 13743 | -6947756673093447609 | copy hw from
> '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> csv, delimiter ',')   | 0 |   0 |
> 0 | 0 |  0 |0 |
>  1 |  265.195105 |265.195105 |265.195105 | 265.195105
> |0 | 175000 |191
> 6 |0 | 946 | 946 |
>  0 |   0 |  0 |  0
> |  0 | 0 |
> 0 |  0 |1116 |   0 |   3587203
>  10 | 13743 |  8570215596364326047 | copy hw from
> '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> csv, delimiter ',', parallel '2') | 0 |   

Re: Parallel copy

2020-10-08 Thread Amit Kapila
On Thu, Oct 8, 2020 at 12:14 AM vignesh C  wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila  wrote:
> >
> >
> > I am convinced by the reason given by Kyotaro-San in that another
> > thread [1] and performance data shown by Peter that this can't be an
> > independent improvement and rather in some cases it can do harm. Now,
> > if you need it for a parallel-copy path then we can change it
> > specifically to the parallel-copy code path but I don't understand
> > your reason completely.
> >
>
> Whenever we need data to be populated, we will get a new data block &
> pass it to CopyGetData to populate the data. In case of file copy, the
> server will completely fill the data block. We expect the data to be
> filled completely. If data is available it will completely load the
> complete data block in case of file copy. There is no scenario where
> even if data is present a partial data block will be returned except
> for EOF or no data available. But in case of STDIN data copy, even
> though there is 8K data available in data block & 8K data available in
> STDIN, CopyGetData will return as soon as libpq buffer data is more
> than the minread. We will pass new data block every time to load data.
> Every time we pass an 8K data block but CopyGetData loads a few bytes
> in the new data block & returns. I wanted to keep the same data
> population logic for both file copy & STDIN copy i.e copy full 8K data
> blocks & then the populated data can be required. There is an
> alternative solution I can have some special handling in case of STDIN
> wherein the existing data block can be passed with the index from
> where the data should be copied. Thoughts?
>

What you are proposing as an alternative solution, isn't that what we
are doing without the patch? IIUC, you require this because of your
corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that
right? If so, what is the difficulty in making it behave similar to
the non-parallel case?

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-07 Thread vignesh C
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila  wrote:
> >
> > Few additional comments:
> > ==
>
> Some more comments:
>
> v5-0002-Framework-for-leader-worker-in-parallel-copy
> ===
> 1.
> These values
> + * help in handover of multiple records with significant size of data to be
> + * processed by each of the workers to make sure there is no context
> switch & the
> + * work is fairly distributed among the workers.
>
> How about writing it as: "These values help in the handover of
> multiple records with the significant size of data to be processed by
> each of the workers. This also ensures there is no context switch and
> the work is fairly distributed among the workers."

Changed as suggested.

>
> 2. Can we keep WORKER_CHUNK_COUNT, MAX_BLOCKS_COUNT, and RINGSIZE as
> power-of-two? Say WORKER_CHUNK_COUNT as 64, MAX_BLOCK_COUNT as 1024,
> and accordingly choose RINGSIZE. At many places, we do that way. I
> think it can sometimes help in faster processing due to cache size
> requirements and in this case, I don't see a reason why we can't
> choose these values to be power-of-two. If you agree with this change
> then also do some performance testing after this change?
>

Modified as suggested, Have checked few performance tests & verified
there is no degradation. We will post a performance run of this
separately in the coming days..

> 3.
> + bool   curr_blk_completed;
> + char   data[DATA_BLOCK_SIZE]; /* data read from file */
> + uint8  skip_bytes;
> +} ParallelCopyDataBlock;
>
> Is there a reason to keep skip_bytes after data? Normally the variable
> size data is at the end of the structure. Also, there is no comment
> explaining the purpose of skip_bytes.
>

Modified as suggested and added comments.

> 4.
> + * Copy data block information.
> + * ParallelCopyDataBlock's will be created in DSM. Data read from file will 
> be
> + * copied in these DSM data blocks. The leader process identifies the records
> + * and the record information will be shared to the workers. The workers will
> + * insert the records into the table. There can be one or more number
> of records
> + * in each of the data block based on the record size.
> + */
> +typedef struct ParallelCopyDataBlock
>
> Keep one empty line after the description line like below. I also
> suggested to do a minor tweak in the above sentence which is as
> follows:
>
> * Copy data block information.
> *
> * These data blocks are created in DSM. Data read ...
>
> Try to follow a similar format in other comments as well.
>

Modified as suggested.

> 5. I think it is better to move parallelism related code to a new file
> (we can name it as copyParallel.c or something like that).
>

Modified, added copyparallel.c file to include copy parallelism
functionality & copyparallel.c file & some of the function prototype &
data structure were moved to copy.h header file so that it can be
shared between copy.c & copyparallel.c

> 6. copy.c(1648,25): warning C4133: 'function': incompatible types -
> from 'ParallelCopyLineState *' to 'uint32 *'
> Getting above compilation warning on Windows.
>

Modified the data type.

> v5-0003-Allow-copy-from-command-to-process-data-from-file
> ==
> 1.
> @@ -4294,7 +5047,7 @@ BeginCopyFrom(ParseState *pstate,
>   * only in text mode.
>   */
>   initStringInfo(>attribute_buf);
> - cstate->raw_buf = (char *) palloc(RAW_BUF_SIZE + 1);
> + cstate->raw_buf = (IsParallelCopy()) ? NULL : (char *)
> palloc(RAW_BUF_SIZE + 1);
>
> Is there anyway IsParallelCopy can be true by this time? AFAICS, we do
> anything about parallelism after this. If you want to save this
> allocation then we need to move this after we determine that
> parallelism can be used or not and accordingly the below code in the
> patch needs to be changed.
>
>  * ParallelCopyFrom - parallel copy leader's functionality.
>   *
>   * Leader executes the before statement for before statement trigger, if 
> before
> @@ -1110,8 +1547,302 @@ ParallelCopyFrom(CopyState cstate)
>   ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info;
>   ereport(DEBUG1, (errmsg("Running parallel copy leader")));
>
> + /* raw_buf is not used in parallel copy, instead data blocks are used.*/
> + pfree(cstate->raw_buf);
> + cstate->raw_buf = NULL;
>

Removed the palloc change, raw_buf will be allocated both for parallel
and non parallel copy. One other solution that I thought was to move
the memory allocation to CopyFrom, but this solution might affect fdw
w

Re: Parallel copy

2020-10-07 Thread vignesh C
On Mon, Sep 28, 2020 at 6:37 PM Ashutosh Sharma  wrote:
>
> On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila  wrote:
> >
> > On Tue, Sep 22, 2020 at 2:44 PM vignesh C  wrote:
> > >
> > > Thanks Ashutosh for your comments.
> > >
> > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma  
> > > wrote:
> > > >
> > > > Hi Vignesh,
> > > >
> > > > I've spent some time today looking at your new set of patches and I've
> > > > some thoughts and queries which I would like to put here:
> > > >
> > > > Why are these not part of the shared cstate structure?
> > > >
> > > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, 
> > > > cstate->null_print);
> > > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
> > > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
> > > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
> > > >
> > >
> > > I have used shared_cstate mainly to share the integer & bool data
> > > types from the leader to worker process. The above data types are of
> > > char* data type, I will not be able to use it like how I could do it
> > > for integer type. So I preferred to send these as separate keys to the
> > > worker. Thoughts?
> > >
> >
> > I think the way you have written will work but if we go with
> > Ashutosh's proposal it will look elegant and in the future, if we need
> > to share more strings as part of cstate structure then that would be
> > easier. You can probably refer to EstimateParamListSpace,
> > SerializeParamList, and RestoreParamList to see how we can share
> > different types of data in one key.
> >
>
> Yeah. And in addition to that it will also reduce the number of DSM
> keys that we need to maintain.
>

Thanks Ashutosh, This is handled as part of the v6 patch set.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-07 Thread Greg Nancarrow
On Thu, Oct 8, 2020 at 5:44 AM vignesh C  wrote:

> Attached v6 patch with the fixes.
>

Hi Vignesh,

I noticed a couple of issues when scanning the code in the following patch:

v6-0003-Allow-copy-from-command-to-process-data-from-file.patch

In the following code, it will put a junk uint16 value into *destptr
(and thus may well cause a crash) on a Big Endian architecture
(Solaris Sparc, s390x, etc.):
You're storing a (uint16) string length in a uint32 and then pulling
out the lower two bytes of the uint32 and copying them into the
location pointed to by destptr.


static void
+CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
+ uint32 *copiedsize)
+{
+ uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
+
+ memcpy(destptr, (uint16 *) , sizeof(uint16));
+ *copiedsize += sizeof(uint16);
+ if (len)
+ {
+ memcpy(destptr + sizeof(uint16), srcPtr, len);
+ *copiedsize += len;
+ }
+}

I suggest you change the code to:

uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
memcpy(destptr, , sizeof(uint16));

[I assume string length here can't ever exceed (65535 - 1), right?]

Looking a bit deeper into this, I'm wondering if in fact your
EstimateStringSize() and EstimateNodeSize() functions should be using
BUFFERALIGN() for EACH stored string/node (rather than just calling
shm_toc_estimate_chunk() once at the end, after the length of packed
strings and nodes has been estimated), to ensure alignment of start of
each string/node. Other Postgres code appears to be aligning each
stored chunk using shm_toc_estimate_chunk(). See the definition of
that macro and its current usages.

Then you could safely use:

uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
*(uint16 *)destptr = len;
*copiedsize += sizeof(uint16);
if (len)
{
memcpy(destptr + sizeof(uint16), srcPtr, len);
*copiedsize += len;
}

and in the CopyStringFromSharedMemory() function, then could safely use:

len = *(uint16 *)srcPtr;

The compiler may be smart enough to optimize-away the memcpy() in this
case anyway, but there are issues in doing this for architectures that
take a performance hit for unaligned access, or don't support
unaligned access.

Also, in CopyFromSharedMemory() functions, you should use palloc()
instead of palloc0(), as you're filling the entire palloc'd buffer
anyway, so no need to ask for additional MemSet() of all buffer bytes
to 0 prior to memcpy().


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Parallel copy

2020-10-07 Thread vignesh C
On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila  wrote:
>
> On Tue, Sep 22, 2020 at 2:44 PM vignesh C  wrote:
> >
> > Thanks Ashutosh for your comments.
> >
> > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma  
> > wrote:
> > >
> > > Hi Vignesh,
> > >
> > > I've spent some time today looking at your new set of patches and I've
> > > some thoughts and queries which I would like to put here:
> > >
> > > Why are these not part of the shared cstate structure?
> > >
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, 
> > > cstate->null_print);
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
> > >
> >
> > I have used shared_cstate mainly to share the integer & bool data
> > types from the leader to worker process. The above data types are of
> > char* data type, I will not be able to use it like how I could do it
> > for integer type. So I preferred to send these as separate keys to the
> > worker. Thoughts?
> >
>
> I think the way you have written will work but if we go with
> Ashutosh's proposal it will look elegant and in the future, if we need
> to share more strings as part of cstate structure then that would be
> easier. You can probably refer to EstimateParamListSpace,
> SerializeParamList, and RestoreParamList to see how we can share
> different types of data in one key.
>

Thanks for the solution Amit, I have fixed this and handled it in the
v6 patch shared in my previous mail.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-07 Thread vignesh C
On Tue, Sep 29, 2020 at 3:16 PM Greg Nancarrow  wrote:
>
> Hi Vignesh and Bharath,
>
> Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as
> parallel-unsafe.
> Can you explain why this is?

Yes we don't need to restrict parallelism for RI_TRIGGER_PK cases as
we don't do any command counter increments while performing PK checks
as opposed to RI_TRIGGER_FK/foreign key checks. We have modified this
in the v6 patch set.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-03 Thread Amit Kapila
On Sat, Oct 3, 2020 at 6:20 AM Tomas Vondra
 wrote:
>
> Hello Vignesh,
>
> I've done some basic benchmarking on the v4 version of the patches (but
> AFAIKC the v5 should perform about the same), and some initial review.
>
> For the benchmarking, I used the lineitem table from TPC-H - for 75GB
> data set, this largest table is about 64GB once loaded, with another
> 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and
> NVME storage.
>
> The COPY duration with varying number of workers (specified using the
> parallel COPY option) looks like this:
>
>   workersduration
>  -
> 01366
> 11255
> 2 704
> 3 526
> 4 434
> 5 385
> 6 347
> 7 322
> 8 327
>
> So this seems to work pretty well - initially we get almost linear
> speedup, then it slows down (likely due to contention for locks, I/O
> etc.). Not bad.
>

+1. These numbers (> 4x speed up) look good to me.


-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-02 Thread Tomas Vondra

Hello Vignesh,

I've done some basic benchmarking on the v4 version of the patches (but
AFAIKC the v5 should perform about the same), and some initial review.

For the benchmarking, I used the lineitem table from TPC-H - for 75GB
data set, this largest table is about 64GB once loaded, with another
54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and
NVME storage.

The COPY duration with varying number of workers (specified using the
parallel COPY option) looks like this:

 workersduration
-
   01366
   11255
   2 704
   3 526
   4 434
   5 385
   6 347
   7 322
   8 327

So this seems to work pretty well - initially we get almost linear
speedup, then it slows down (likely due to contention for locks, I/O
etc.). Not bad.

I've only done a quick review, but overall the patch looks in fairly
good shape.

1) I don't quite understand why we need INCREMENTPROCESSED and
RETURNPROCESSED, considering it just does ++ or return. It just
obfuscated the code, I think.

2) I find it somewhat strange that BeginParallelCopy can just decide not
to do parallel copy after all. Why not to do this decisions in the
caller? Or maybe it's fine this way, not sure.

3) AFAIK we don't modify typedefs.list in patches, so these changes
should be removed. 


4) IsTriggerFunctionParallelSafe actually checks all triggers, not just
one, so the comment needs minor rewording.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Parallel copy

2020-10-01 Thread Amit Kapila
On Tue, Sep 29, 2020 at 3:16 PM Greg Nancarrow  wrote:
>
> Hi Vignesh and Bharath,
>
> Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as
> parallel-unsafe.
> Can you explain why this is?
>

I don't think we need to restrict this case and even if there is some
reason to do so then probably the same should be mentioned in the
comments.

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-09-29 Thread vignesh C
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila  wrote:
> >
> > Few additional comments:
> > ==
>
> Some more comments:
>

Thanks Amit for the comments, I will work on the comments and provide
a patch in the next few days.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-09-29 Thread Amit Kapila
On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila  wrote:
>
> Few additional comments:
> ==

Some more comments:

v5-0002-Framework-for-leader-worker-in-parallel-copy
===
1.
These values
+ * help in handover of multiple records with significant size of data to be
+ * processed by each of the workers to make sure there is no context
switch & the
+ * work is fairly distributed among the workers.

How about writing it as: "These values help in the handover of
multiple records with the significant size of data to be processed by
each of the workers. This also ensures there is no context switch and
the work is fairly distributed among the workers."

2. Can we keep WORKER_CHUNK_COUNT, MAX_BLOCKS_COUNT, and RINGSIZE as
power-of-two? Say WORKER_CHUNK_COUNT as 64, MAX_BLOCK_COUNT as 1024,
and accordingly choose RINGSIZE. At many places, we do that way. I
think it can sometimes help in faster processing due to cache size
requirements and in this case, I don't see a reason why we can't
choose these values to be power-of-two. If you agree with this change
then also do some performance testing after this change?

3.
+ bool   curr_blk_completed;
+ char   data[DATA_BLOCK_SIZE]; /* data read from file */
+ uint8  skip_bytes;
+} ParallelCopyDataBlock;

Is there a reason to keep skip_bytes after data? Normally the variable
size data is at the end of the structure. Also, there is no comment
explaining the purpose of skip_bytes.

4.
+ * Copy data block information.
+ * ParallelCopyDataBlock's will be created in DSM. Data read from file will be
+ * copied in these DSM data blocks. The leader process identifies the records
+ * and the record information will be shared to the workers. The workers will
+ * insert the records into the table. There can be one or more number
of records
+ * in each of the data block based on the record size.
+ */
+typedef struct ParallelCopyDataBlock

Keep one empty line after the description line like below. I also
suggested to do a minor tweak in the above sentence which is as
follows:

* Copy data block information.
*
* These data blocks are created in DSM. Data read ...

Try to follow a similar format in other comments as well.

5. I think it is better to move parallelism related code to a new file
(we can name it as copyParallel.c or something like that).

6. copy.c(1648,25): warning C4133: 'function': incompatible types -
from 'ParallelCopyLineState *' to 'uint32 *'
Getting above compilation warning on Windows.

v5-0003-Allow-copy-from-command-to-process-data-from-file
==
1.
@@ -4294,7 +5047,7 @@ BeginCopyFrom(ParseState *pstate,
  * only in text mode.
  */
  initStringInfo(>attribute_buf);
- cstate->raw_buf = (char *) palloc(RAW_BUF_SIZE + 1);
+ cstate->raw_buf = (IsParallelCopy()) ? NULL : (char *)
palloc(RAW_BUF_SIZE + 1);

Is there anyway IsParallelCopy can be true by this time? AFAICS, we do
anything about parallelism after this. If you want to save this
allocation then we need to move this after we determine that
parallelism can be used or not and accordingly the below code in the
patch needs to be changed.

 * ParallelCopyFrom - parallel copy leader's functionality.
  *
  * Leader executes the before statement for before statement trigger, if before
@@ -1110,8 +1547,302 @@ ParallelCopyFrom(CopyState cstate)
  ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info;
  ereport(DEBUG1, (errmsg("Running parallel copy leader")));

+ /* raw_buf is not used in parallel copy, instead data blocks are used.*/
+ pfree(cstate->raw_buf);
+ cstate->raw_buf = NULL;

Is there anything else also the allocation of which depends on parallelism?

2.
+static pg_attribute_always_inline bool
+IsParallelCopyAllowed(CopyState cstate)
+{
+ /* Parallel copy not allowed for frontend (2.0 protocol) & binary option. */
+ if ((cstate->copy_dest == COPY_OLD_FE) || cstate->binary)
+ return false;
+
+ /* Check if copy is into foreign table or temporary table. */
+ if (cstate->rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE ||
+ RelationUsesLocalBuffers(cstate->rel))
+ return false;
+
+ /* Check if trigger function is parallel safe. */
+ if (cstate->rel->trigdesc != NULL &&
+ !IsTriggerFunctionParallelSafe(cstate->rel->trigdesc))
+ return false;
+
+ /*
+ * Check if there is after statement or instead of trigger or transition
+ * table triggers.
+ */
+ if (cstate->rel->trigdesc != NULL &&
+ (cstate->rel->trigdesc->trig_insert_after_statement ||
+ cstate->rel->trigdesc->trig_insert_instead_row ||
+ cstate->rel->trigdesc->trig_insert_new_table))
+ return false;
+
+ /* Check if the volatile expressions are parallel safe, if present any. */
+ if (!CheckExprParallelSafety(cstate))
+ return false;
+
+ /* Check if the insertion mode is single. */
+ if (FindInsertMethod(cstate) 

Re: Parallel copy

2020-09-29 Thread Greg Nancarrow
Hi Vignesh and Bharath,

Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as
parallel-unsafe.
Can you explain why this is?

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Parallel copy

2020-09-28 Thread Ashutosh Sharma
On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila  wrote:
>
> On Tue, Sep 22, 2020 at 2:44 PM vignesh C  wrote:
> >
> > Thanks Ashutosh for your comments.
> >
> > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma  
> > wrote:
> > >
> > > Hi Vignesh,
> > >
> > > I've spent some time today looking at your new set of patches and I've
> > > some thoughts and queries which I would like to put here:
> > >
> > > Why are these not part of the shared cstate structure?
> > >
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, 
> > > cstate->null_print);
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
> > >
> >
> > I have used shared_cstate mainly to share the integer & bool data
> > types from the leader to worker process. The above data types are of
> > char* data type, I will not be able to use it like how I could do it
> > for integer type. So I preferred to send these as separate keys to the
> > worker. Thoughts?
> >
>
> I think the way you have written will work but if we go with
> Ashutosh's proposal it will look elegant and in the future, if we need
> to share more strings as part of cstate structure then that would be
> easier. You can probably refer to EstimateParamListSpace,
> SerializeParamList, and RestoreParamList to see how we can share
> different types of data in one key.
>

Yeah. And in addition to that it will also reduce the number of DSM
keys that we need to maintain.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com




Re: Parallel copy

2020-09-28 Thread Amit Kapila
On Tue, Sep 22, 2020 at 2:44 PM vignesh C  wrote:
>
> Thanks Ashutosh for your comments.
>
> On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma  wrote:
> >
> > Hi Vignesh,
> >
> > I've spent some time today looking at your new set of patches and I've
> > some thoughts and queries which I would like to put here:
> >
> > Why are these not part of the shared cstate structure?
> >
> > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print);
> > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
> > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
> > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
> >
>
> I have used shared_cstate mainly to share the integer & bool data
> types from the leader to worker process. The above data types are of
> char* data type, I will not be able to use it like how I could do it
> for integer type. So I preferred to send these as separate keys to the
> worker. Thoughts?
>

I think the way you have written will work but if we go with
Ashutosh's proposal it will look elegant and in the future, if we need
to share more strings as part of cstate structure then that would be
easier. You can probably refer to EstimateParamListSpace,
SerializeParamList, and RestoreParamList to see how we can share
different types of data in one key.

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-09-28 Thread Amit Kapila
On Wed, Jul 22, 2020 at 7:48 PM vignesh C  wrote:
>
> On Tue, Jul 21, 2020 at 3:54 PM Amit Kapila  wrote:
> >
>
> > Review comments:
> > ===
> >
> > 0001-Copy-code-readjustment-to-support-parallel-copy
> > 1.
> > @@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate)
> >   else
> >   nbytes = 0; /* no data need be saved */
> >
> > + if (cstate->copy_dest == COPY_NEW_FE)
> > + minread = RAW_BUF_SIZE - nbytes;
> > +
> >   inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
> > -   1, RAW_BUF_SIZE - nbytes);
> > +   minread, RAW_BUF_SIZE - nbytes);
> >
> > No comment to explain why this change is done?
> >
> > 0002-Framework-for-leader-worker-in-parallel-copy
>
> Currently CopyGetData copies a lesser amount of data to buffer even though 
> space is available in buffer because minread was passed as 1 to CopyGetData. 
> Because of this there are frequent call to CopyGetData for fetching the data. 
> In this case it will load only some data due to the below check:
> while (maxread > 0 && bytesread < minread && !cstate->reached_eof)
> After reading some data bytesread will be greater than minread which is 
> passed as 1 and return with lesser amount of data, even though there is some 
> space.
> This change is required for parallel copy feature as each time we get a new 
> DSM data block which is of 64K size and copy the data. If we copy less data 
> into DSM data blocks we might end up consuming all the DSM data blocks.
>

Why can't we reuse the DSM block which has unfilled space?

>  I felt this issue can be fixed as part of HEAD. Have posted a separate 
> thread [1] for this. I'm planning to remove that change once it gets 
> committed. Can that go as a separate
> patch or should we include it here?
> [1] - 
> https://www.postgresql.org/message-id/CALDaNm0v4CjmvSnftYnx_9pOS_dKRG%3DO3NnBgJsQmi0KipvLog%40mail.gmail.com
>

I am convinced by the reason given by Kyotaro-San in that another
thread [1] and performance data shown by Peter that this can't be an
independent improvement and rather in some cases it can do harm. Now,
if you need it for a parallel-copy path then we can change it
specifically to the parallel-copy code path but I don't understand
your reason completely.

> > 2.
..
> > + */
> > +typedef struct ParallelCopyLineBoundary
> >
> > Are we doing all this state management to avoid using locks while
> > processing lines?  If so, I think we can use either spinlock or LWLock
> > to keep the main patch simple and then provide a later patch to make
> > it lock-less.  This will allow us to first focus on the main design of
> > the patch rather than trying to make this datastructure processing
> > lock-less in the best possible way.
> >
>
> The steps will be more or less same if we use spinlock too. step 1, step 3 & 
> step 4 will be common we have to use lock & unlock instead of step 2 & step 
> 5. I feel we can retain the current implementation.
>

I'll study this in detail and let you know my opinion on the same but
in the meantime, I don't follow one part of this comment: "If they
don't follow this order the worker might process wrong line_size and
leader might populate the information which worker has not yet
processed or in the process of processing."

Do you want to say that leader might overwrite some information which
worker hasn't read yet? If so, it is not clear from the comment.
Another minor point about this comment:

+ * ParallelCopyLineBoundary is common data structure between leader & worker,
+ * Leader process will be populating data block, data block offset &
the size of

I think there should be a full-stop after worker instead of a comma.

>
> > 6.
> > In function BeginParallelCopy(), you need to keep a provision to
> > collect wal_usage and buf_usage stats.  See _bt_begin_parallel for
> > reference.  Those will be required for pg_stat_statements.
> >
>
> Fixed
>

How did you ensure that this is fixed? Have you tested it, if so
please share the test? I see a basic problem with your fix.

+ /* Report WAL/buffer usage during parallel execution */
+ bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false);
+ walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false);
+ InstrEndParallelQuery([ParallelWorkerNumber],
+   [ParallelWorkerNumber]);

You need to call InstrStartParallelQuery() before the actual operation
starts, without that stats won't be accurate? Also, after calling
WaitForParallelWorkersToFinish(), you need to accumulate the stats
collected from workers which neither you have done nor is possible
with the current code in your patch because you haven't made any
provision to capt

Re: Parallel copy

2020-09-24 Thread Ashutosh Sharma
On Thu, Sep 24, 2020 at 3:00 PM Bharath Rupireddy
 wrote:
>
> >
> > > Have you tested your patch when encoding conversion is needed? If so,
> > > could you please point out the email that has the test results.
> > >
> >
> > We have not yet done encoding testing, we will do and post the results
> > separately in the coming days.
> >
>
> Hi Ashutosh,
>
> I ran the tests ensuring pg_server_to_any() gets called from copy.c. I 
> specified the encoding option of COPY command, with client and server 
> encodings being UTF-8.
>

Thanks Bharath for the testing. The results look impressive.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com




Re: Parallel copy

2020-09-24 Thread Bharath Rupireddy
>
> > Have you tested your patch when encoding conversion is needed? If so,
> > could you please point out the email that has the test results.
> >
>
> We have not yet done encoding testing, we will do and post the results
> separately in the coming days.
>

Hi Ashutosh,

I ran the tests ensuring pg_server_to_any() gets called from copy.c. I
specified the encoding option of COPY command, with client and server
encodings being UTF-8.

Tests are performed with custom postgresql.conf[1], 10million rows, 5.2GB
data. The results are of the triplet form (exec time in sec, number of
workers, gain)

Use case 1: 2 indexes on integer columns, 1 index on text column
(1174.395, 0, 1X), (1127.792, 1, 1.04X), (644.260, 2, 1.82X), (341.284, 4,
3.43X), (204.423, 8, 5.74X), (140.692, 16, 8.34X), (129.843, 20, 9.04X),
(134.511, 30, 8.72X)

Use case 2: 1 gist index on text column
(811.412, 0, 1X), (772.203, 1, 1.05X), (437.364, 2, 1.85X), (263.575, 4,
3.08X), (175.135, 8, 4.63X), (155.355, 16, 5.22X), (178.704, 20, 4.54X),
(199.402, 30, 4.06)

Use case 3: 3 indexes on integer columns
(220.680, 0, 1X), (185.096, 1, 1.19X), (134.811, 2, 1.64X), (114.585, 4,
1.92X), (107.707, 8, 2.05X), (101.253, 16, 2.18X), (100.749, 20, 2.19X),
(100.656, 30, 2.19X)

The results are similar to our earlier runs[2].

[1]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

[2]
https://www.postgresql.org/message-id/CALDaNm13zK%3DJXfZWqZJsm3%2B2yagYDJc%3DeJBgE4i77-4PPNj7vw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-09-24 Thread Bharath Rupireddy
Thanks Greg for the testing.

On Thu, Sep 24, 2020 at 8:27 AM Greg Nancarrow  wrote:
>
> > 3. Could you please run the test case 3 times at least? Just to ensure
the consistency of the issue.
>
> Yes, have run 4 times. Seems to be a performance hit (whether normal
> copy or parallel-1 copy) on the first COPY run on a freshly created
> database. After that, results are consistent.
>

>From the logs, I see that it is happening only with default
postgresql.conf, and there's inconsistency in table insertion times,
especially from the 1st time to 2nd time. Also, the table insertion time
variation is more. This is expected with the default postgresql.conf,
because of the background processes interference. That's the reason we
usually run with custom configuration to correctly measure the performance
gain.

br_default_0_1.log:
2020-09-23 22:32:36.944 JST [112616] LOG:  totaltableinsertiontime =
155068.244 ms
2020-09-23 22:33:57.615 JST [11426] LOG:  totaltableinsertiontime =
42096.275 ms
2020-09-23 22:37:39.192 JST [43097] LOG:  totaltableinsertiontime =
29135.262 ms
2020-09-23 22:38:56.389 JST [54205] LOG:  totaltableinsertiontime =
38953.912 ms
2020-09-23 22:40:27.573 JST [66485] LOG:  totaltableinsertiontime =
27895.326 ms
2020-09-23 22:41:34.948 JST [77523] LOG:  totaltableinsertiontime =
28929.642 ms
2020-09-23 22:43:18.938 JST [89857] LOG:  totaltableinsertiontime =
30625.015 ms
2020-09-23 22:44:21.938 JST [101372] LOG:  totaltableinsertiontime =
24624.045 ms

br_default_1_0.log:
2020-09-24 11:12:14.989 JST [56146] LOG:  totaltableinsertiontime =
192068.350 ms
2020-09-24 11:13:38.228 JST [88455] LOG:  totaltableinsertiontime =
30999.942 ms
2020-09-24 11:15:50.381 JST [108935] LOG:  totaltableinsertiontime =
31673.204 ms
2020-09-24 11:17:14.260 JST [118541] LOG:  totaltableinsertiontime =
31367.027 ms
2020-09-24 11:20:18.975 JST [17270] LOG:  totaltableinsertiontime =
26858.924 ms
2020-09-24 11:22:17.822 JST [26852] LOG:  totaltableinsertiontime =
66531.442 ms
2020-09-24 11:24:09.221 JST [47971] LOG:  totaltableinsertiontime =
38943.384 ms
2020-09-24 11:25:30.955 JST [58849] LOG:  totaltableinsertiontime =
28286.634 ms

br_custom_0_1.log:
2020-09-24 10:29:44.956 JST [110477] LOG:  totaltableinsertiontime =
20207.928 ms
2020-09-24 10:30:49.570 JST [120568] LOG:  totaltableinsertiontime =
23360.006 ms
2020-09-24 10:32:31.659 JST [2753] LOG:  totaltableinsertiontime =
19837.588 ms
2020-09-24 10:35:49.245 JST [31118] LOG:  totaltableinsertiontime =
21759.253 ms
2020-09-24 10:36:54.834 JST [41763] LOG:  totaltableinsertiontime =
23547.323 ms
2020-09-24 10:38:53.507 JST [56779] LOG:  totaltableinsertiontime =
21543.984 ms
2020-09-24 10:39:58.713 JST [67489] LOG:  totaltableinsertiontime =
25254.563 ms

br_custom_1_0.log:
2020-09-24 10:49:03.242 JST [15308] LOG:  totaltableinsertiontime =
16541.201 ms
2020-09-24 10:50:11.848 JST [23324] LOG:  totaltableinsertiontime =
15076.577 ms
2020-09-24 10:51:24.497 JST [35394] LOG:  totaltableinsertiontime =
16400.777 ms
2020-09-24 10:52:32.354 JST [42953] LOG:  totaltableinsertiontime =
15591.051 ms
2020-09-24 10:54:30.327 JST [61136] LOG:  totaltableinsertiontime =
16700.954 ms
2020-09-24 10:55:38.377 JST [68719] LOG:  totaltableinsertiontime =
15435.150 ms
2020-09-24 10:57:08.927 JST [83335] LOG:  totaltableinsertiontime =
17133.251 ms
2020-09-24 10:58:17.420 JST [90905] LOG:  totaltableinsertiontime =
15352.753 ms

>
> Test results show that Parallel COPY with 1 worker is performing
> better than normal COPY in the test scenarios run.
>

Good to know :)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-09-23 Thread Greg Nancarrow
Hi Bharath,

> Few things to follow before testing:
> 1. Is the table being dropped/truncated after the test with 0 workers and 
> before running with 1 worker? If not, then the index insertion time would 
> increase.[1](for me it is increasing by 10 sec). This is obvious because the 
> 1st time index will be created from bottom up manner(from leaves to root), 
> but for the 2nd time it has to search and insert at the proper leaves and 
> inner B+Tree nodes.

Yes, it' being truncated before running each and every COPY.

> 2. If possible, can you also run with custom postgresql.conf settings[2] 
> along with default? Just to ensure that other bg processes such as 
> checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For 
> instance, with default postgresql.conf file, it looks like checkpointing[3] 
> is happening frequently, could you please let us know if that happens at your 
> end?

Yes, have run with default and your custom settings. With default
settings, I can confirm that checkpointing is happening frequently
with the tests I've run here.

> 3. Could you please run the test case 3 times at least? Just to ensure the 
> consistency of the issue.

Yes, have run 4 times. Seems to be a performance hit (whether normal
copy or parallel-1 copy) on the first COPY run on a freshly created
database. After that, results are consistent.

> 4. I ran the tests in a performance test system where no other user 
> processes(except system processes) are running. Is it possible for you to do 
> the same?
>
> Please capture and share the timing logs with us.
>

Yes, I have ensured the system is as idle as possible prior to testing.

I have attached the test results obtained after building with your
Parallel Copy patch and testing patch applied (HEAD at
733fa9aa51c526582f100aa0d375e0eb9a6bce8b).

Test results show that Parallel COPY with 1 worker is performing
better than normal COPY in the test scenarios run. There is a
performance hit (regardless of COPY type) on the very first COPY run
on a freshly-created database.

I ran the test case 4 times. and also in reverse order, with truncate
run before each COPY (output and logs named _0_1 run normal COPY
then parallel COPY, and named _1_0 run parallel COPY and then
normal COPY).

Please refer to attached results.

Regards,
Greg


testing_patch_results.tar.gz
Description: application/gzip


Re: Parallel copy

2020-09-22 Thread Bharath Rupireddy
On Thu, Sep 17, 2020 at 11:06 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:
>
> On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow 
wrote:
> >
> > Fortunately I have been given permission to share the exact table
> > definition and data I used, so you can check the behaviour and timings
> > on your own test machine.
> >
>
> Thanks Greg for the script. I ran your test case and I didn't observe
> any increase in exec time with 1 worker, indeed, we have benefitted a
> few seconds from 0 to 1 worker as expected.
>
> Execution time is in seconds. Each test case is executed 3 times on
> release build. Each time the data directory is recreated.
>
> Case 1: 100 rows, 2GB
> With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423
> With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678
>
> With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822
> With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573
>
> Case 2: 255 rows, 5GB
> With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683
> With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307
>
> With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049
> With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090
>

Hi Greg,

If you still observe the issue in your testing environment, I'm attaching a
testing patch(that applies on top of the latest parallel copy patch set
i.e. v5 1 to 6) to capture various timings such as total copy time in
leader and worker, index and table insertion time, leader and worker
waiting time. These logs are shown in the server log file.

Few things to follow before testing:
1. Is the table being dropped/truncated after the test with 0 workers and
before running with 1 worker? If not, then the index insertion time would
increase.[1](for me it is increasing by 10 sec). This is obvious because
the 1st time index will be created from bottom up manner(from leaves to
root), but for the 2nd time it has to search and insert at the proper
leaves and inner B+Tree nodes.
2. If possible, can you also run with custom postgresql.conf settings[2]
along with default? Just to ensure that other bg processes such as
checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For
instance, with default postgresql.conf file, it looks like checkpointing[3]
is happening frequently, could you please let us know if that happens at
your end?
3. Could you please run the test case 3 times at least? Just to ensure the
consistency of the issue.
4. I ran the tests in a performance test system where no other user
processes(except system processes) are running. Is it possible for you to
do the same?

Please capture and share the timing logs with us.

Here's a snapshot of how the added timings show up in the logs: ( I
captured this with your test case case 1: 100 rows, 2GB, custom
postgresql.conf file settings[2]).
with 0 workers:
2020-09-22 10:49:27.508 BST [163910] LOG:  totaltableinsertiontime =
24072.034 ms
2020-09-22 10:49:27.508 BST [163910] LOG:  totalindexinsertiontime = 60.682
ms
2020-09-22 10:49:27.508 BST [163910] LOG:  totalcopytime = 59664.594 ms

with 1 worker:
2020-09-22 10:53:58.409 BST [163947] LOG:  totalcopyworkerwaitingtime =
59.815 ms
2020-09-22 10:53:58.409 BST [163947] LOG:  totaltableinsertiontime =
23585.881 ms
2020-09-22 10:53:58.409 BST [163947] LOG:  totalindexinsertiontime = 30.946
ms
2020-09-22 10:53:58.409 BST [163947] LOG:  totalcopytimeworker = 47047.956
ms
2020-09-22 10:53:58.429 BST [163946] LOG:  totalcopyleaderwaitingtime =
26746.744 ms
2020-09-22 10:53:58.429 BST [163946] LOG:  totalcopytime = 47150.002 ms

[1]
0 worker:
LOG:  totaltableinsertiontime = 25491.881 ms
LOG:  totalindexinsertiontime = 14136.104 ms
LOG:  totalcopytime = 75606.858 ms
table is not dropped and so are indexes
1 worker:
LOG:  totalcopyworkerwaitingtime = 64.582 ms
LOG:  totaltableinsertiontime = 21360.875 ms
LOG:  totalindexinsertiontime = 24843.570 ms
LOG:  totalcopytimeworker = 69837.162 ms
LOG:  totalcopyleaderwaitingtime = 49548.441 ms
LOG:  totalcopytime = 69997.778 ms

[2]
custom postgresql.conf configuration:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

[3]
LOG:  checkpoints are occurring too frequently (14 seconds apart)
HINT:  Consider increasing the configuration parameter "max_wal_size".

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Parallel-Copy-Exec-Time-Capture.patch
Description: Binary data


Re: Parallel copy

2020-09-16 Thread Bharath Rupireddy
On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow  wrote:
>
> Fortunately I have been given permission to share the exact table
> definition and data I used, so you can check the behaviour and timings
> on your own test machine.
>

Thanks Greg for the script. I ran your test case and I didn't observe
any increase in exec time with 1 worker, indeed, we have benefitted a
few seconds from 0 to 1 worker as expected.

Execution time is in seconds. Each test case is executed 3 times on
release build. Each time the data directory is recreated.

Case 1: 100 rows, 2GB
With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423
With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678

With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822
With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573

Case 2: 255 rows, 5GB
With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683
With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307

With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049
With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090

[1] - Custom configuration is set up to ensure that no other processes
influence the results. The postgresql.conf used:
shared_buffers = 40GB
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-09-16 Thread Ashutosh Sharma
is is
> > >explained in detail above ParallelCopyLineBoundary.
> >
> > Yes, but prior to that call to pg_atomic_compare_exchange_u32(),
> > aren't you separately reading line_state and line_state, so that
> > between those reads, it may have transitioned from leader to another
> > worker, such that the read line state ("cur_line_state", being checked
> > in the if block) may not actually match what is now in the line_state
> > and/or the read line_size ("dataSize") doesn't actually correspond to
> > the read line state?
> >
> > (sorry, still not 100% convinced that the synchronization and checks
> > are safe in all cases)
> >
>
> I think that you are describing about the problem could happen in the
> following case:
> when we read curr_line_state, the value was LINE_WORKER_PROCESSED or
> LINE_WORKER_PROCESSING. Then in some cases if the leader is very fast
> compared to the workers then the leader quickly populates one line and
> sets the state to LINE_LEADER_POPULATED. State is changed to
> LINE_LEADER_POPULATED when we are checking the currr_line_state.
> I feel this will not be a problem because, Leader will populate & wait
> till some RING element is available to populate. In the meantime
> worker has seen that state is LINE_WORKER_PROCESSED or
> LINE_WORKER_PROCESSING(previous state that it read), worker has
> identified that this chunk was processed by some other worker, worker
> will move and try to get the next available chunk & insert those
> records. It will keep continuing till it gets the next chunk to
> process. Eventually one of the workers will get this chunk and process
> it.
>
> > (3) v3-0006-Parallel-Copy-For-Binary-Format-Files.patch
> >
> > >raw_buf is not used in parallel copy, instead raw_buf will be pointing
> > >to shared memory data blocks. This memory was allocated as part of
> > >BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be
> > >performed sequentially like in case max_worker_processes is not
> > >available, if it switches to sequential mode raw_buf will be used
> > >while performing copy operation. At this place we can safely free this
> > >memory that was allocated
> >
> > So the following code (which checks raw_buf, which still points to
> > memory that has been pfreed) is still valid?
> >
> >   In the SetRawBufForLoad() function, which is called by CopyReadLineText():
> >
> > cur_data_blk_ptr = (cstate->raw_buf) ?
> > _info->data_blocks[cur_block_pos] : NULL;
> >
> > The above code looks a bit dicey to me. I stepped over that line in
> > the debugger when I debugged an instance of Parallel Copy, so it
> > definitely gets executed.
> > It makes me wonder what other code could possibly be checking raw_buf
> > and using it in some way, when in fact what it points to has been
> > pfreed.
> >
> > Are you able to add the following line of code, or will it (somehow)
> > break logic that you are relying on?
> >
> > pfree(cstate->raw_buf);
> > cstate->raw_buf = NULL;   <=== I suggest that this line is added
> >
>
> You are right, I have debugged & verified it sets it to an invalid
> block which is not expected. There are chances this would have caused
> some corruption in some machines. The suggested fix is required, I
> have fixed it. I have moved this change to
> 0003-Allow-copy-from-command-to-process-data-from-file.patch as
> 0006-Parallel-Copy-For-Binary-Format-Files is only for Binary format
> parallel copy & that change is common change for parallel copy.
>
> I have attached new set of patches with the fixes.
> Thoughts?
>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-09-16 Thread Greg Nancarrow
Hi Bharath,

On Tue, Sep 15, 2020 at 11:49 PM Bharath Rupireddy
 wrote:
>
> Few questions:
>  1. Was the run performed with default postgresql.conf file? If not,
> what are the changed configurations?
Yes, just default settings.

>  2. Are the readings for normal copy(190.891sec, mentioned by you
> above) taken on HEAD or with patch, 0 workers?
With patch

>How much is the runtime
> with your test case on HEAD(Without patch) and 0 workers(With patch)?
TBH, I didn't test that. Looking at the changes, I wouldn't expect a
degradation of performance for normal copy (you have tested, right?).

>  3. Was the run performed on release build?
For generating the perf data I sent (normal copy vs parallel copy with
1 worker), I used a debug build (-g -O0), as that is needed for
generating all the relevant perf data for Postgres code. Previously I
ran with a release build (-O2).

>  4. Were the readings taken on multiple runs(say 3 or 4 times)?
The readings I sent were from just one run (not averaged), but I did
run the tests several times to verify the readings were representative
of the pattern I was seeing.


Fortunately I have been given permission to share the exact table
definition and data I used, so you can check the behaviour and timings
on your own test machine.
Please see the attachment.
You can create the table using the table.sql and index_4.sql
definitions in the "sql" directory.
The data.csv file (to be loaded by COPY) can be created with the
included "dupdata" tool in the "input" directory, which you need to
build, then run, specifying a suitable number of records and path of
the template record (see README). Obviously the larger the number of
records, the larger the file ...
The table can then be loaded using COPY with "format csv" (and
"parallel N" if testing parallel copy).

Regards,
Greg Nancarrow
Fujitsu Australia
<>


Re: Parallel copy

2020-09-15 Thread Bharath Rupireddy
On Fri, Sep 11, 2020 at 3:49 AM Greg Nancarrow  wrote:
>
> I couldn't use the original machine from which I obtained the previous
> results, but ended up using a 4-core CentOS7 VM, which showed a
> similar pattern in the performance results for this test case.
> I obtained the following results from loading a 2GB CSV file (100
> rows, 4 indexes):
>
> Copy TypeDuration (s)  Load factor
> ===
> Normal Copy    190.891    -
>
> Parallel Copy
> (#workers)
> 1210.947   0.90
>
Hi Greg,

I tried to recreate the test case(attached) and I didn't find much
difference with the custom postgresql.config file.
Test case: 25 tuples, 4 indexes(composite indexes with 10
columns), 3.7GB, 100 columns(as suggested by you and all the
varchar(255) columns are having 255 characters), exec time in sec.

With custom postgresql.conf[1], removed and recreated the data
directory after every run(I couldn't perform the OS page cache flush
due to some reasons. So, chose this recreation of data dir way, for
testing purpose):
 HEAD: 129.547, 128.624, 128.890
 Patch: 0 workers - 130.213, 131.298, 130.555
 Patch: 1 worker - 127.757, 125.560, 128.275

With default postgresql.conf, removed and recreated the data directory
after every run:
 HEAD: 138.276, 150.472, 153.304
 Patch: 0 workers - 162.468,  149.423, 159.137
 Patch: 1 worker - 136.055, 144.250, 137.916

Few questions:
 1. Was the run performed with default postgresql.conf file? If not,
what are the changed configurations?
 2. Are the readings for normal copy(190.891sec, mentioned by you
above) taken on HEAD or with patch, 0 workers? How much is the runtime
with your test case on HEAD(Without patch) and 0 workers(With patch)?
 3. Was the run performed on release build?
 4. Were the readings taken on multiple runs(say 3 or 4 times)?

[1] - Postgres configuration used for above testing:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
{\rtf1\ansi\ansicpg1252\cocoartf2513
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\paperw11900\paperh16840\margl1440\margr1440\vieww30560\viewh11000\viewkind0
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural\partightenfactor0

\f0\fs24 \cf0  DROP TABLE IF EXISTS t100_1;\
 CREATE TABLE t100_1 (\
 c1_s_1 bigserial,\
 c2_vc_1 varchar(255),\
 c3_d_1 date,\
 c4_n_1 numeric,\
 c5_vc_2 varchar(255),\
 c6_d_2 date,\
 c7_vc_3 varchar(255),\
 c8_t_1 time without time zone,\
 c9_vc_4 varchar(255),\
 c10_vc_5 varchar(255),\
 c11_t_2 time without time zone,\
 c12_vc_6 varchar(255),\
 c13_vc_7 varchar(255),\
 c14_n_2 numeric,\
 c15_d_3 date,\
 c16_t_3 time without time zone,\
 c17_n_3 numeric,\
 c18_vc_8 varchar(255),\
 c19_d_4 date,\
 c20_vc_9 varchar(255),\
 c21_t_4 time without time zone,\
 c22_vc_10 varchar(255),\
 c23_d_5 date,\
 c24_vc_11 varchar(255),\
 c25_t_5 time without time zone,\
 c26_t_6 time without time zone,\
 c27_vc_12 varchar(255),\
 c28_vc_13 varchar(255),\
 c29_d_6 date,\
 c30_vc_14 varchar(255),\
 c31_vc_15 varchar(255),\
 c32_t_7 time without time zone,\
 c33_t_8 time without time zone,\
 c34_vc_16 varchar(255),\
 c35_vc_17 varchar(255),\
 c36_d_7 date,\
 c37_d_8 date,\
 c38_vc_18 varchar(255),\
 c39_t_9 time without time zone,\
 c40_vc_19 varchar(255),\
 c41_vc_20 varchar(255),\
 c42_vc_21 varchar(255),\
 c43_vc_22 varchar(255),\
 c44_t_10 time without time zone,\
 c45_d_9 date,\
 c46_vc_23 varchar(255),\
 c47_t_11 time without time zone,\
 c48_vc_24 varchar(255),\
 c49_vc_25 varchar(255),\
 c50_vc_26 varchar(255),\
 c51_d_10 date,\
 c52_vc_27 varchar(255),\
 c53_vc_28 varchar(255),\
 c54_vc_29 varchar(255),\
 c55_vc_30 varchar(255),\
 c56_d_11 date,\
 c57_vc_31 varchar(255),\
 c58_vc_32 varchar(255),\
 c59_vc_33 varchar(255),\
 c60_vc_34 varchar(255),\
 c61_vc_35 varchar(255),\
 c62_vc_36 varchar(255),\
 c63_vc_37 varchar(255),\
 c64_vc_38 varchar(255),\
 c65_vc_39 varchar(255),\
 c66_n_4 numeric,\
 c67_vc_40 varchar(255),\
 c68_vc_41 varchar(255),\
 c69_vc_42 varchar(255),\
 c70_vc_43 varchar(255),\
 c71_n_5 numeric,\
 c72_vc_44 varchar(255),\
 c73_n_6 numeric,\
 c74_vc_45 varchar(255),\
 c75_vc_46 varchar(255),\
 c76_vc_47 varchar(255),\
 c77_n_7 numeric,\
 c78_n_8 numeric,\
 c79_d_12 date,\
 c80_n_9 numeric,\
 c81_vc_48 varchar(255),\
 c82_vc_49 varchar(255),\
 c83_vc_50 varchar(255),\
 c84_n_10 numeric,\
 c85_vc_51 varchar(255),\
 c86_vc_52 varchar(255),\
 c87_vc_53 varchar(255),\
 c88_d_13 date,\
 c89_n_13 numeric,\
 c90_vc_54 varchar(255),\
 c91_vc_55 varchar(255),\
 c92_n_11 numeric

Re: Parallel copy

2020-09-07 Thread vignesh C
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow  wrote:
>
> Hi Vignesh,
>
> >Can you share with me the script you used to generate the data & the ddl of 
> >the table, so that it will help me check that >scenario you faced the 
> >>problem.
>
> Unfortunately I can't directly share it (considered company IP),
> though having said that it's only doing something that is relatively
> simple and unremarkable, so I'd expect it to be much like what you are
> currently doing. I can describe it in general.
>
> The table being used contains 100 columns (as I pointed out earlier),
> with the first column of "bigserial" type, and the others of different
> types like "character varying(255)", "numeric", "date" and "time
> without timezone". There's about 60 of the "character varying(255)"
> overall, with the other types interspersed.
>

Thanks Greg for executing & sharing the results.
I tried with a similar test case that you suggested, I was not able to
reproduce the degradation scenario.
If it is possible, can you run perf for the scenario with 1 worker &
non parallel mode & share the perf results, we will be able to find
out which of the functions is consuming more time by doing a
comparison of the perf reports.
Steps for running perf:
1) get the postgres pid
2) perf record -a -g -p 
3) Run copy command
4) Execute "perf report -g" once copy finishes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-09-03 Thread Greg Nancarrow
>On Wed, Sep 2, 2020 at 3:40 PM vignesh C  wrote:
> I have attached the scripts that I used for the test results I
> mentioned in my previous mail. create.sql file has the table that I
> used, insert_data_gen.txt has the insert data generation scripts. I
> varied the count in insert_data_gen to generate csv files of 1GB, 2GB
> & 5GB & varied the data to generate 1 char, 10 char & 100 char for
> each column for various testing. You can rename insert_data_gen.txt to
> insert_data_gen.sh & generate the csv file.


Hi Vignesh,

I used your script and table definition, multiplying the number of
records to produce a 5GB and 9.5GB CSV file.
I got the following results:


(1) Postgres default settings, 5GB CSV (53 rows):

Copy TypeDuration (s)   Load factor
===
Normal Copy  132.197 -

Parallel Copy
(#workers)
198.428  1.34
252.753  2.51
337.630  3.51
433.554  3.94
533.636  3.93
633.821  3.91
734.270  3.86
834.465  3.84
934.315  3.85
10   33.543  3.94


(2) Postgres increased resources, 5GB CSV (53 rows):

shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB
work_mem = 10% of RAM = 38GB
maintenance_work_mem = 10% of RAM = 38GB
max_worker_processes = 16
max_parallel_workers = 16
checkpoint_timeout = 30min
max_wal_size=2GB


Copy TypeDuration (s)   Load factor
===
Normal Copy  131.835 -

Parallel Copy
(#workers)
198.301  1.34
253.261  2.48
337.868  3.48
434.224  3.85
533.831  3.90
634.229  3.85
734.512  3.82
834.303  3.84
934.690  3.80
10   34.479  3.82



(3) Postgres default settings, 9.5GB CSV (100 rows):

Copy TypeDuration (s)   Load factor
===
Normal Copy  248.503 -

Parallel Copy
(#workers)
1185.724 1.34
299.832  2.49
370.560  3.52
463.328  3.92
563.182  3.93
664.108  3.88
764.131  3.87
864.350  3.86
964.293  3.87
10   63.818  3.89


(4) Postgres increased resources, 9.5GB CSV (100 rows):

shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB
work_mem = 10% of RAM = 38GB
maintenance_work_mem = 10% of RAM = 38GB
max_worker_processes = 16
max_parallel_workers = 16
checkpoint_timeout = 30min
max_wal_size=2GB


Copy TypeDuration (s)   Load factor
===
Normal Copy  248.647-

Parallel Copy
(#workers)
1182.2361.36
292.814 2.68
367.347 3.69
463.839 3.89
562.672 3.97
663.873 3.89
764.930 3.83
863.885 3.89
962.397 3.98
10   64.477 3.86



So as you found, with this particular table definition and data, 1
parallel worker always performs better than normal copy.
The different result obtained for this particular case seems to be
caused by the following factors:
- different table definition (I used a variety of column types)
- amount of data per row (I used less data per row, so more rows per
same size data file)

As I previously observed, if the target table has no indexes,
increasing resources beyond the default settings makes little
difference to the performance.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Parallel copy

2020-09-01 Thread vignesh C
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow  wrote:
>
> Hi Vignesh,
>
> >Can you share with me the script you used to generate the data & the ddl of 
> >the table, so that it will help me check that >scenario you faced the 
> >>problem.
>
> Unfortunately I can't directly share it (considered company IP),
> though having said that it's only doing something that is relatively
> simple and unremarkable, so I'd expect it to be much like what you are
> currently doing. I can describe it in general.
>
> The table being used contains 100 columns (as I pointed out earlier),
> with the first column of "bigserial" type, and the others of different
> types like "character varying(255)", "numeric", "date" and "time
> without timezone". There's about 60 of the "character varying(255)"
> overall, with the other types interspersed.
>
> When testing with indexes, 4 b-tree indexes were used that each
> included the first column and then distinctly 9 other columns.
>
> A CSV record (row) template file was created with test data
> (corresponding to the table), and that was simply copied and appended
> over and over with a record prefix in order to create the test data
> file.
> The following shell-script basically does it (but very slowly). I was
> using a small C program to do similar, a lot faster.
> In my case, N=255 produced about a 5GB CSV file.
>
> file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out;
> cat sample_record.csv >> $file_out; done
>
> One other thing I should mention is that between each test run, I
> cleared the OS page cache, as described here:
> https://linuxhint.com/clear_cache_linux/
> That way, each COPY FROM is not taking advantage of any OS-cached data
> from a previous COPY FROM.

I will try with a similar test and check if I can reproduce.

> If your data is somehow significantly different and you want to (and
> can) share your script, then I can try it in my environment.

I have attached the scripts that I used for the test results I
mentioned in my previous mail. create.sql file has the table that I
used, insert_data_gen.txt has the insert data generation scripts. I
varied the count in insert_data_gen to generate csv files of 1GB, 2GB
& 5GB & varied the data to generate 1 char, 10 char & 100 char for
each column for various testing. You can rename insert_data_gen.txt to
insert_data_gen.sh & generate the csv file.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
#!/bin/bash
for i in {1..10}
do
   echo 

Re: Parallel copy

2020-09-01 Thread Greg Nancarrow
Hi Vignesh,

>Can you share with me the script you used to generate the data & the ddl of 
>the table, so that it will help me check that >scenario you faced the >problem.

Unfortunately I can't directly share it (considered company IP),
though having said that it's only doing something that is relatively
simple and unremarkable, so I'd expect it to be much like what you are
currently doing. I can describe it in general.

The table being used contains 100 columns (as I pointed out earlier),
with the first column of "bigserial" type, and the others of different
types like "character varying(255)", "numeric", "date" and "time
without timezone". There's about 60 of the "character varying(255)"
overall, with the other types interspersed.

When testing with indexes, 4 b-tree indexes were used that each
included the first column and then distinctly 9 other columns.

A CSV record (row) template file was created with test data
(corresponding to the table), and that was simply copied and appended
over and over with a record prefix in order to create the test data
file.
The following shell-script basically does it (but very slowly). I was
using a small C program to do similar, a lot faster.
In my case, N=255 produced about a 5GB CSV file.

file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out;
cat sample_record.csv >> $file_out; done

One other thing I should mention is that between each test run, I
cleared the OS page cache, as described here:
https://linuxhint.com/clear_cache_linux/
That way, each COPY FROM is not taking advantage of any OS-cached data
from a previous COPY FROM.

If your data is somehow significantly different and you want to (and
can) share your script, then I can try it in my environment.


Regards,
Greg




Re: Parallel copy

2020-08-31 Thread vignesh C
On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow  wrote:
> - Parallel Copy with 1 worker ran slower than normal Copy in a couple
> of cases (I did question if allowing 1 worker was useful in my patch
> review).

Thanks Greg for your review & testing.
I had executed various tests with 1GB, 2GB & 5GB with 100 columns without
parallel mode & with 1 parallel worker. Test result for the same is as
given below:
Test Without parallel mode With 1 Parallel worker
1GB csv file 100 columns
(100 bytes data in each column) 62 seconds 47 seconds (1.32X)
1GB csv file 100 columns
(1000 bytes data in each column) 89 seconds 78 seconds (1.14X)
2GB csv file 100 columns
(1 byte data in each column) 277 seconds 256 seconds (1.08X)
5GB csv file 100 columns
(100 byte data in each column) 515 seconds 445 seconds (1.16X)
I have run the tests multiple times and have noticed the similar execution
times in all the runs for the above tests.
In the above results there is slight improvement with 1 worker. In my tests
I did not observe the degradation for copy with 1 worker compared to the
non parallel copy. Can you share with me the script you used to generate
the data & the ddl of the table, so that it will help me check that
scenario you faced the problem.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel copy

2020-08-27 Thread Amit Kapila
On Thu, Aug 27, 2020 at 4:56 PM vignesh C  wrote:
>
> On Thu, Aug 27, 2020 at 8:24 AM Amit Kapila  wrote:
> >
> > On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow  wrote:
> > >
> > > > I have attached new set of patches with the fixes.
> > > > Thoughts?
> > >
> > > Hi Vignesh,
> > >
> > > I don't really have any further comments on the code, but would like
> > > to share some results of some Parallel Copy performance tests I ran
> > > (attached).
> > >
> > > The tests loaded a 5GB CSV data file into a 100 column table (of
> > > different data types). The following were varied as part of the test:
> > > - Number of workers (1 – 10)
> > > - No indexes / 4-indexes
> > > - Default settings / increased resources (shared_buffers,work_mem, etc.)
> > >
> > > (I did not do any partition-related tests as I believe those type of
> > > tests were previously performed)
> > >
> > > I built Postgres (latest OSS code) with the latest Parallel Copy patches 
> > > (v4).
> > > The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM.
> > >
> > >
> > > I observed the following trends:
> > > - For the data file size used, Parallel Copy achieved best performance
> > > using about 9 – 10 workers. Larger data files may benefit from using
> > > more workers. However, I couldn’t really see any better performance,
> > > for example, from using 16 workers on a 10GB CSV data file compared to
> > > using 8 workers. Results may also vary depending on machine
> > > characteristics.
> > > - Parallel Copy with 1 worker ran slower than normal Copy in a couple
> > > of cases (I did question if allowing 1 worker was useful in my patch
> > > review).
> >
> > I think the reason is that for 1 worker case there is not much
> > parallelization as a leader doesn't perform the actual load work.
> > Vignesh, can you please once see if the results are reproducible at
> > your end, if so, we can once compare the perf profiles to see why in
> > some cases we get improvement and in other cases not. Based on that we
> > can decide whether to allow the 1 worker case or not.
> >
>
> I will spend some time on this and update.
>

Thanks.

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-08-27 Thread vignesh C
On Thu, Aug 27, 2020 at 8:24 AM Amit Kapila  wrote:
>
> On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow  wrote:
> >
> > > I have attached new set of patches with the fixes.
> > > Thoughts?
> >
> > Hi Vignesh,
> >
> > I don't really have any further comments on the code, but would like
> > to share some results of some Parallel Copy performance tests I ran
> > (attached).
> >
> > The tests loaded a 5GB CSV data file into a 100 column table (of
> > different data types). The following were varied as part of the test:
> > - Number of workers (1 – 10)
> > - No indexes / 4-indexes
> > - Default settings / increased resources (shared_buffers,work_mem, etc.)
> >
> > (I did not do any partition-related tests as I believe those type of
> > tests were previously performed)
> >
> > I built Postgres (latest OSS code) with the latest Parallel Copy patches 
> > (v4).
> > The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM.
> >
> >
> > I observed the following trends:
> > - For the data file size used, Parallel Copy achieved best performance
> > using about 9 – 10 workers. Larger data files may benefit from using
> > more workers. However, I couldn’t really see any better performance,
> > for example, from using 16 workers on a 10GB CSV data file compared to
> > using 8 workers. Results may also vary depending on machine
> > characteristics.
> > - Parallel Copy with 1 worker ran slower than normal Copy in a couple
> > of cases (I did question if allowing 1 worker was useful in my patch
> > review).
>
> I think the reason is that for 1 worker case there is not much
> parallelization as a leader doesn't perform the actual load work.
> Vignesh, can you please once see if the results are reproducible at
> your end, if so, we can once compare the perf profiles to see why in
> some cases we get improvement and in other cases not. Based on that we
> can decide whether to allow the 1 worker case or not.
>

I will spend some time on this and update.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-08-26 Thread Amit Kapila
On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow  wrote:
>
> > I have attached new set of patches with the fixes.
> > Thoughts?
>
> Hi Vignesh,
>
> I don't really have any further comments on the code, but would like
> to share some results of some Parallel Copy performance tests I ran
> (attached).
>
> The tests loaded a 5GB CSV data file into a 100 column table (of
> different data types). The following were varied as part of the test:
> - Number of workers (1 – 10)
> - No indexes / 4-indexes
> - Default settings / increased resources (shared_buffers,work_mem, etc.)
>
> (I did not do any partition-related tests as I believe those type of
> tests were previously performed)
>
> I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4).
> The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM.
>
>
> I observed the following trends:
> - For the data file size used, Parallel Copy achieved best performance
> using about 9 – 10 workers. Larger data files may benefit from using
> more workers. However, I couldn’t really see any better performance,
> for example, from using 16 workers on a 10GB CSV data file compared to
> using 8 workers. Results may also vary depending on machine
> characteristics.
> - Parallel Copy with 1 worker ran slower than normal Copy in a couple
> of cases (I did question if allowing 1 worker was useful in my patch
> review).

I think the reason is that for 1 worker case there is not much
parallelization as a leader doesn't perform the actual load work.
Vignesh, can you please once see if the results are reproducible at
your end, if so, we can once compare the perf profiles to see why in
some cases we get improvement and in other cases not. Based on that we
can decide whether to allow the 1 worker case or not.

> - Typical load time improvement (load factor) for Parallel Copy was
> between 2x and 3x. Better load factors can be obtained by using larger
> data files and/or more indexes.
>

Nice improvement and I think you are right that with larger load data
we will get even better improvement.

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-08-26 Thread Greg Nancarrow
> I have attached new set of patches with the fixes.
> Thoughts?

Hi Vignesh,

I don't really have any further comments on the code, but would like
to share some results of some Parallel Copy performance tests I ran
(attached).

The tests loaded a 5GB CSV data file into a 100 column table (of
different data types). The following were varied as part of the test:
- Number of workers (1 – 10)
- No indexes / 4-indexes
- Default settings / increased resources (shared_buffers,work_mem, etc.)

(I did not do any partition-related tests as I believe those type of
tests were previously performed)

I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4).
The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM.


I observed the following trends:
- For the data file size used, Parallel Copy achieved best performance
using about 9 – 10 workers. Larger data files may benefit from using
more workers. However, I couldn’t really see any better performance,
for example, from using 16 workers on a 10GB CSV data file compared to
using 8 workers. Results may also vary depending on machine
characteristics.
- Parallel Copy with 1 worker ran slower than normal Copy in a couple
of cases (I did question if allowing 1 worker was useful in my patch
review).
- Typical load time improvement (load factor) for Parallel Copy was
between 2x and 3x. Better load factors can be obtained by using larger
data files and/or more indexes.
- Increasing Postgres resources made little or no difference to
Parallel Copy performance when the target table had no indexes.
Increasing Postgres resources improved Parallel Copy performance when
the target table had indexes.

Regards,
Greg Nancarrow
Fujitsu Australia
(1) Postgres default settings, 5GB CSV (510 rows), no indexes on table:

Copy Type   Duration (s)Load factor
===
Normal Copy 132.838     -

Parallel Copy
(#workers)
1   97.537  1.36
2   61.700  2.15
3   52.788  2.52
4   46.607  2.85
5   45.524  2.92
6   43.799  3.03
7   42.970  3.09
8   42.974  3.09
9   43.698  3.04
10  43.362  3.06


(2) Postgres default settings, 5GB CSV (510 rows), 4 indexes on table:

Copy Type   Duration (s)Load factor
===
Normal Copy 221.111     -

Parallel Copy
(#workers)
1   331.609 0.66
2   99.085  2.23
3   89.751  2.46
4   81.137  2.73
5   79.138  2.79
6   77.155  2.87
7   75.813  2.92
8   74.961  2.95
9   77.803  2.84
10  75.399  2.93


(3) Postgres increased resources, 5GB CSV (510 rows), no indexes on table:

shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB
work_mem = 10% of RAM = 38GB
maintenance_work_mem = 10% of RAM = 38GB
max_worker_processes = 16
max_parallel_workers = 16 
checkpoint_timeout = 30min
max_wal_size=2GB


Copy Type   Duration (s)Load factor
===
Normal Copy 78.138      -

Parallel Copy
(#workers)
1   95.203  0.82
2   62.596  1.24
3   52.318  1.49
4   48.246  1.62
5   42.832  1.82
6   42.921  1.82
7   43.146  1.81
8   41.557  1.88
9   43.489  1.80
10  43.362  1.80


(4) Postgres increased resources, 5GB CSV (510 rows), 4 indexes on table:

Copy Type   Duration (s)Load 

Re: Parallel copy

2020-08-16 Thread Greg Nancarrow
Hi Vignesh,

Some further comments:

(1) v3-0002-Framework-for-leader-worker-in-parallel-copy.patch

+/*
+ * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
+ * block to process to avoid lock contention. This value should be divisible by
+ * RINGSIZE, as wrap around cases is currently not handled while selecting the
+ * WORKER_CHUNK_COUNT by the worker.
+ */
+#define WORKER_CHUNK_COUNT 50


"This value should be divisible by RINGSIZE" is not a correct
statement (since obviously 50 is not divisible by 1).
It should say something like "This value should evenly divide into
RINGSIZE", or "RINGSIZE should be a multiple of WORKER_CHUNK_COUNT".


(2) v3-0003-Allow-copy-from-command-to-process-data-from-file.patch

(i)

+   /*
+* If the data is present in current block
lineInfo. line_size
+* will be updated. If the data is spread
across the blocks either

Somehow a space has been put between "lineinfo." and "line_size".
It should be: "If the data is present in current block
lineInfo.line_size will be updated"

(ii)

>This is not possible because of pg_atomic_compare_exchange_u32, this
>will succeed only for one of the workers whose line_state is
>LINE_LEADER_POPULATED, for other workers it will fail. This is
>explained in detail above ParallelCopyLineBoundary.

Yes, but prior to that call to pg_atomic_compare_exchange_u32(),
aren't you separately reading line_state and line_state, so that
between those reads, it may have transitioned from leader to another
worker, such that the read line state ("cur_line_state", being checked
in the if block) may not actually match what is now in the line_state
and/or the read line_size ("dataSize") doesn't actually correspond to
the read line state?

(sorry, still not 100% convinced that the synchronization and checks
are safe in all cases)

(3) v3-0006-Parallel-Copy-For-Binary-Format-Files.patch

>raw_buf is not used in parallel copy, instead raw_buf will be pointing
>to shared memory data blocks. This memory was allocated as part of
>BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be
>performed sequentially like in case max_worker_processes is not
>available, if it switches to sequential mode raw_buf will be used
>while performing copy operation. At this place we can safely free this
>memory that was allocated

So the following code (which checks raw_buf, which still points to
memory that has been pfreed) is still valid?

  In the SetRawBufForLoad() function, which is called by CopyReadLineText():

cur_data_blk_ptr = (cstate->raw_buf) ?
_info->data_blocks[cur_block_pos] : NULL;

The above code looks a bit dicey to me. I stepped over that line in
the debugger when I debugged an instance of Parallel Copy, so it
definitely gets executed.
It makes me wonder what other code could possibly be checking raw_buf
and using it in some way, when in fact what it points to has been
pfreed.

Are you able to add the following line of code, or will it (somehow)
break logic that you are relying on?

pfree(cstate->raw_buf);
cstate->raw_buf = NULL;   <=== I suggest that this line is added

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Parallel copy

2020-08-11 Thread Greg Nancarrow
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, failed

Hi,

I don't claim to yet understand all of the Postgres internals that this patch 
is updating and interacting with, so I'm still testing and debugging portions 
of this patch, but would like to give feedback on what I've noticed so far.
I have done some ad-hoc testing of the patch using parallel copies from 
text/csv/binary files and have not yet struck any execution problems other than 
some option validation and associated error messages on boundary cases.

One general question that I have: is there a user benefit (over the normal 
non-parallel COPY) to allowing "COPY ... FROM ... WITH (PARALLEL 1)"?


My following comments are broken down by patch:

(1) v2-0001-Copy-code-readjustment-to-support-parallel-copy.patch

(i) Whilst I can't entirely blame these patches for it (as they are following 
what is already there), I can't help noticing the use of numerous macros in 
src/backend/commands/copy.c which paste in multiple lines of code in various 
places.
It's getting a little out-of-hand. Surely the majority of these would be best 
inline functions instead?
Perhaps hasn't been done because too many parameters need to be passed - 
thoughts?


(2) v2-0002-Framework-for-leader-worker-in-parallel-copy.patch

(i) minor point: there are some tabbing/spacing issues in this patch (and the 
other patches), affecting alignment.
e.g. mixed tabs/spaces and misalignment in PARALLEL_COPY_KEY_xxx definitions

(ii)

+/*
+ * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
+ * block to process to avoid lock contention. This value should be mode of
+ * RINGSIZE, as wrap around cases is currently not handled while selecting the
+ * WORKER_CHUNK_COUNT by the worker.
+ */
+#define WORKER_CHUNK_COUNT 50


"This value should be mode of RINGSIZE ..."

-> typo: mode  (mod?  should evenly divide into RINGSIZE?)


(iii)
+ *using pg_atomic_compare_exchange_u32, worker will change the sate to

->typo: sate  (should be "state")


(iv)

+errmsg("parallel option 
supported only for copy from"),

-> suggest change to:   errmsg("parallel option is supported only for 
COPY FROM"),

(v)

+   errno = 0; /* To distinguish success/failure after call 
*/
+   val = strtol(str, , 10);
+
+   /* Check for various possible errors */
+   if ((errno == ERANGE && (val == LONG_MAX || val == 
LONG_MIN))
+   || (errno != 0 && val == 0) ||
+   *endptr)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("improper use of 
argument to option \"%s\"",
+   defel->defname),
+parser_errposition(pstate, 
defel->location)));
+
+   if (endptr == str)
+  ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("no digits were found 
in argument to option \"%s\"",
+   defel->defname),
+parser_errposition(pstate, 
defel->location)));
+
+   cstate->nworkers = (int) val;
+
+   if (cstate->nworkers <= 0)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("argument to option 
\"%s\" must be a positive integer greater than zero",
+   defel->defname),
+parser_errposition(pstate, 
defel->location)));


I think this validation code needs to be improved, including the error messages 
(e.g. when can a "positive integer" NOT be greater than zero?)

There is some overlap in the "no digits were found" case between the two 
conditions above, depending, for example, if the argument is quoted. 
Also, "improper use of argument to option" sounds a bit odd and vague to me. 
Finally, not range checking before casting long to int can lead to allowing 
out-of-range int values like in the following case:

test=# copy mytable from '/myspace/test_pc

Re: Parallel copy

2020-08-05 Thread vignesh C
On Tue, Aug 4, 2020 at 9:51 PM Tomas Vondra
 wrote:
>
> On Mon, Aug 03, 2020 at 12:33:48PM +0530, Bharath Rupireddy wrote:
> >On Sat, Aug 1, 2020 at 9:55 AM vignesh C  wrote:
> >>
> >> The patches were not applying because of the recent commits.
> >> I have rebased the patch over head & attached.
> >>
> >I rebased v2-0006-Parallel-Copy-For-Binary-Format-Files.patch.
> >
> >Putting together all the patches rebased on to the latest commit
> >b8fdee7d0ca8bd2165d46fb1468f75571b706a01. Patches from 0001 to 0005
> >are rebased by Vignesh, that are from the previous mail and the patch
> >0006 is rebased by me.
> >
> >Please consider this patch set for further review.
> >
>
> I'd suggest incrementing the version every time an updated version is
> submitted, even if it's just a rebased version. It makes it clearer
> which version of the code is being discussed, etc.

Sure, we will take care of this when we are sending the next set of patches.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




  1   2   3   >