subject:"Parallel copy"

On Sat, Nov 7, 2020 at 7:01 PM vignesh C  wrote:
>
> On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie 
wrote:
> >
> > Hi
> >
> > >
> > > my $bytes = $ARGV[0];
> > > for(my $i = 0; $i < $bytes; $i+=8){
> > >  print "longdata";
> > > }
> > > print "\n";
> > > 
> > >
> > > postgres=# copy longdata from program 'perl /tmp/longdata.pl
1'
> > > with (parallel 2);
> > >
> > > This gets stuck forever (or at least I didn't have the patience to
wait
> > > it finish). Both worker processes are consuming 100% of CPU.
> >
> > I had a look over this problem.
> >
> > the ParallelCopyDataBlock has size limit:
> > uint8   skip_bytes;
> > chardata[DATA_BLOCK_SIZE];  /* data read from file
*/
> >
> > It seems the input line is so long that the leader process run out of
the Shared memory among parallel copy workers.
> > And the leader process keep waiting free block.
> >
> > For the worker process, it wait util line_state becomes
LINE_LEADER_POPULATED,
> > But leader process won't set the line_state unless it read the whole
line.
> >
> > So it stuck forever.
> > May be we should reconsider about this situation.
> >
> > The stack is as follows:
> >
> > Leader stack:
> > #3  0x0075f7a1 in WaitLatch (latch=,
wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1,
wait_event_info=wait_event_info@entry=150994945) at latch.c:411
> > #4  0x005a9245 in WaitGetFreeCopyBlock
(pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546
> > #5  0x005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88,
line_size=67108864, copy_buf_len=copy_buf_len@entry=65536,
raw_buf_ptr=raw_buf_ptr@entry=65536,
> > copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at
copyparallel.c:1572
> > #6  0x005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88)
at copy.c:4058
> > #7  0x005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88)
at copy.c:3863
> >
> > Worker stack:
> > #0  GetLinePosition (cstate=cstate@entry=0x29e1f28) at
copyparallel.c:1474
> > #1  0x005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28,
buff_count=buff_count@entry=0) at copyparallel.c:711
> > #2  0x005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28)
at copyparallel.c:885
> > #3  0x005a4f2e in NextCopyFromRawFields 
> > (cstate=cstate@entry=0x29e1f28,
fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44)
at copy.c:3615
> > #4  0x005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28,
econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at
copy.c:3696
> > #5  0x005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at
copy.c:2985
> >
>
> Thanks for providing your thoughts. I have analyzed this issue and I'm
> working on the fix for this, I will be posting a patch for this
> shortly.
>

I have fixed and provided a patch for this at [1]
[1]
https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

On Sat, Oct 31, 2020 at 2:07 AM Tomas Vondra 
wrote:
>
> Hi,
>
> I've done a bit more testing today, and I think the parsing is busted in
> some way. Consider this:
>
>  test=# create extension random;
>  CREATE EXTENSION
>
>  test=# create table t (a text);
>  CREATE TABLE
>
>  test=# insert into t select random_string(random_int(10, 256*1024))
from generate_series(1,1);
>  INSERT 0 1
>
>  test=# copy t to '/mnt/data/t.csv';
>  COPY 1
>
>  test=# truncate t;
>  TRUNCATE TABLE
>
>  test=# copy t from '/mnt/data/t.csv';
>  COPY 1
>
>  test=# truncate t;
>  TRUNCATE TABLE
>
>  test=# copy t from '/mnt/data/t.csv' with (parallel 2);
>  ERROR:  invalid byte sequence for encoding "UTF8": 0x00
>  CONTEXT:  COPY t, line 485: "m&\nh%_a"%r]>qtCl:Q5ltvF~;2oS6@HB
>F>og,bD$Lw'nZY\tYl#BH\t{(j~ryoZ08"SGU~.}8CcTRk1\ts$@U3szCC+U1U3i@P..."
>  parallel worker
>
>
> The functions come from an extension I use to generate random data, I've
> pushed it to github [1]. The random_string() generates a random string
> with ASCII characters, symbols and a couple special characters (\r\n\t).
> The intent was to try loading data where a fields may span multiple 64kB
> blocks and may contain newlines etc.
>
> The non-parallel copy works fine, the parallel one fails. I haven't
> investigated the details, but I guess it gets confused about where a
> string starts/end, or something like that.
>

Thanks for identifying this issue, this issue is fixed in v10 patch posted
at [1]
[1]
https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

On Fri, Nov 13, 2020 at 2:25 PM Amit Kapila  wrote:
>
> On Wed, Nov 11, 2020 at 10:42 PM vignesh C  wrote:
> >
> > On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila  wrote:
> > >
> > > On Tue, Nov 10, 2020 at 7:12 PM vignesh C  wrote:
> > > >
> > > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > >
> > > > I have worked to provide a patch for the parallel safety checks. It
> > > > checks if parallely copy can be performed, Parallel copy cannot be
> > > > performed for the following a) If relation is temporary table b) If
> > > > relation is foreign table c) If relation has non parallel safe index
> > > > expressions d) If relation has triggers present whose type is of non
> > > > before statement trigger type e) If relation has check constraint
> > > > which are not parallel safe f) If relation has partition and any
> > > > partition has the above type. This patch has the checks for it. This
> > > > patch will be used by parallel copy implementation.
> > > >
> > >
> > > How did you ensure that this is sufficient? For parallel-insert's
> > > patch we have enabled parallel-mode for Inserts and ran the tests with
> > > force_parallel_mode to see if we are not missing anything. Also, it
> > > seems there are many common things here w.r.t parallel-insert patch,
> > > is it possible to prepare this atop that patch or do you have any
> > > reason to keep this separate?
> > >
> >
> > I have done similar testing for copy too, I had set force_parallel
> > mode to regress, hardcoded in the code to pick parallel workers for
> > copy operation and ran make installcheck-world to verify. Many checks
> > in this patch are common between both patches, but I was not sure how
> > to handle it as both the projects are in-progress and are being
> > updated based on the reviewer's opinion. How to handle this?
> > Thoughts?
> >
>
> I have not studied the differences in detail but if it is possible to
> prepare it on top of that patch then there shouldn't be a problem. To
> avoid confusion if you want you can always either post the latest
> version of that patch with your patch or point to it.
>

I have made this as a separate patch as of now. I will work on to see
if I can use Greg's changes as it is or if required I will provide a
few review comments on top of Greg's patch so that it is usable for
parallel copy too and later post a separate patch with the changes on
top of it. I will retain it as a separate patch till that time.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

On Thu, Oct 29, 2020 at 2:26 PM Daniel Westermann (DWE)
 wrote:
>
> On 27/10/2020 15:36, vignesh C wrote:
> >> Attached v9 patches have the fixes for the above comments.
>
> >I did some testing:
>
> I did some testing as well and have a cosmetic remark:
>
> postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 10);
> ERROR:  value 10 out of bounds for option "parallel"
> DETAIL:  Valid values are between "1" and "1024".
> postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000);
> ERROR:  parallel requires an integer value
> postgres=#
>
> Wouldn't it make more sense to only have one error message? The first one 
> seems to be the better message.
>

I had seen similar behavior in other places too:
postgres=# vacuum (parallel 10) t1;
ERROR:  parallel vacuum degree must be between 0 and 1024
LINE 1: vacuum (parallel 10) t1;
^
postgres=# vacuum (parallel 1000) t1;
ERROR:  parallel requires an integer value

I'm not sure if we should fix this.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

On Wed, Oct 28, 2020 at 5:36 PM Hou, Zhijie 
wrote:
>
> Hi
>
> I found some issue in v9-0002
>
> 1.
> +
> +   elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d,
unprocessed lines:%d, offset:%d, line size:%d",
> +write_pos, lineInfo->first_block,
> +
 pg_atomic_read_u32(_blk_ptr->unprocessed_line_parts),
> +offset, pg_atomic_read_u32(>line_size));
> +
>
> write_pos or other variable to be printed here are type of uint32, I
think it'better to use '%u' in elog msg.
>

Modified it.

> 2.
> +* line_size will be set. Read the line_size again to be
sure if it is
> +* completed or partial block.
> +*/
> +   dataSize = pg_atomic_read_u32(>line_size);
> +   if (dataSize)
>
> It use dataSize( type int ) to get uint32 which seems a little dangerous.
> Is it better to define dataSize uint32 here?
>

Modified it.

> 3.
> Since function with  'Cstate' in name has been changed to 'CState'
> I think we can change function PopulateCommonCstateInfo as well.
>

Modified it.

> 4.
> +   if (pcdata->worker_line_buf_count)
>
> I think some check like the above can be 'if (xxx > 0)', which seems
easier to understand.

Modified it.

Thanks for the comments, these issues are fixed in v10 patch posted at [1]
[1]
https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

On Thu, Oct 29, 2020 at 2:20 PM Heikki Linnakangas  wrote:
>
> On 27/10/2020 15:36, vignesh C wrote:
> > Attached v9 patches have the fixes for the above comments.
>
> I did some testing:
>
> /tmp/longdata.pl:
> 
> #!/usr/bin/perl
> #
> # Generate three rows:
> # foo
> # longdatalongdatalongdata...
> # bar
> #
> # The length of the middle row is given as command line arg.
> #
>
> my $bytes = $ARGV[0];
>
> print "foo\n";
> for(my $i = 0; $i < $bytes; $i+=8){
>  print "longdata";
> }
> print "\n";
> print "bar\n";
> 
>
> postgres=# copy longdata from program 'perl /tmp/longdata.pl 1'
> with (parallel 2);
>
> This gets stuck forever (or at least I didn't have the patience to wait
> it finish). Both worker processes are consuming 100% of CPU.
>

Thanks for identifying this issue, this issue is fixed in v10 patch posted
at [1]
[1]
https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-11-17 Thread Bharath Rupireddy

On Thu, Oct 29, 2020 at 2:54 PM Amit Kapila  wrote:
>
> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.
>
> One more point, I have noticed that some time back [1], I have given
> one suggestion related to the way workers process the set of lines
> (aka chunk). I think you can try by increasing the chunk size to say
> 100, 500, 1000 and use some shared counter to remember the number of
> chunks processed.
>

Hi, I did some analysis on using spinlock protected worker write position
i.e. each worker acquires spinlock on a shared write position to choose the
next available chunk vs each worker hops to get the next available chunk
position:

Use Case: 10mn rows, 5.6GB data, 2 indexes on integer columns, 1 index on
text column, results are of the form (no of workers, total exec time in
sec, index insertion time in sec, worker write pos get time in sec, buffer
contention event count):

With spinlock:
(1,1126.443,1060.067,0.478,*0*), (2,669.343,630.769,0.306,*26*),
(4,346.297,326.950,0.161,*89*), (8,209.600,196.417,0.088,*291*),
(16,166.113,157.086,0.065,*1468*), (20,173.884,166.013,0.067,*2700*),
(30,173.087,1166.565,0.0065,*5346*)
Without spinlock:
(1,1119.695,1054.586,0.496,*0*), (2,645.733,608.313,1.5,*8*),
(4,340.620,320.344,1.6,*58*), (8,203.985,189.644,1.3,*222*),
(16,142.997,133.045,1,*813*), (20,132.621,122.527,1.1,*1215*),
(30,135.737,126.716,1.5,*2901*)

With spinlock each worker is getting the required write position quickly
and proceeding further till the index insertion(which is becoming a single
point of contention) where we observed more buffer lock contention. Reason
is that all the workers are reaching the index insertion point at the
similar time.

Without spinlock, each worker is spending some time in hopping to get the
write position, by the time the other workers are inserting into the
indexes. So basically, all the workers are not reaching the index insertion
point at the same time and hence less buffer lock contention.

The same behaviour(explained above) is observed with different worker chunk
count(default 64, 128, 512 and 1024) i.e. the number of tuples each worker
caches into its local memory before inserting into table.

In summary: with spinlock, it looks like we are able to avoid workers
waiting to get the next chunk, which also means that we are not creating
any contention point inside the parallel copy code. However this is causing
another choking point i.e. index insertion if indexes are available on the
table, which is out of scope of parallel copy code. We think that it would
be good to use spinlock-protected worker write position or an atomic
variable for worker write position(as it performs equal to spinlock or
little better in some platforms). Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-11-13 Thread Amit Kapila

On Wed, Nov 11, 2020 at 10:42 PM vignesh C  wrote:
>
> On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila  wrote:
> >
> > On Tue, Nov 10, 2020 at 7:12 PM vignesh C  wrote:
> > >
> > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > I have worked to provide a patch for the parallel safety checks. It
> > > checks if parallely copy can be performed, Parallel copy cannot be
> > > performed for the following a) If relation is temporary table b) If
> > > relation is foreign table c) If relation has non parallel safe index
> > > expressions d) If relation has triggers present whose type is of non
> > > before statement trigger type e) If relation has check constraint
> > > which are not parallel safe f) If relation has partition and any
> > > partition has the above type. This patch has the checks for it. This
> > > patch will be used by parallel copy implementation.
> > >
> >
> > How did you ensure that this is sufficient? For parallel-insert's
> > patch we have enabled parallel-mode for Inserts and ran the tests with
> > force_parallel_mode to see if we are not missing anything. Also, it
> > seems there are many common things here w.r.t parallel-insert patch,
> > is it possible to prepare this atop that patch or do you have any
> > reason to keep this separate?
> >
>
> I have done similar testing for copy too, I had set force_parallel
> mode to regress, hardcoded in the code to pick parallel workers for
> copy operation and ran make installcheck-world to verify. Many checks
> in this patch are common between both patches, but I was not sure how
> to handle it as both the projects are in-progress and are being
> updated based on the reviewer's opinion. How to handle this?
> Thoughts?
>

I have not studied the differences in detail but if it is possible to
prepare it on top of that patch then there shouldn't be a problem. To
avoid confusion if you want you can always either post the latest
version of that patch with your patch or point to it.

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

2020-11-11 Thread vignesh C

On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila  wrote:
>
> On Tue, Nov 10, 2020 at 7:12 PM vignesh C  wrote:
> >
> > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  wrote:
> > >
> >
> > I have worked to provide a patch for the parallel safety checks. It
> > checks if parallely copy can be performed, Parallel copy cannot be
> > performed for the following a) If relation is temporary table b) If
> > relation is foreign table c) If relation has non parallel safe index
> > expressions d) If relation has triggers present whose type is of non
> > before statement trigger type e) If relation has check constraint
> > which are not parallel safe f) If relation has partition and any
> > partition has the above type. This patch has the checks for it. This
> > patch will be used by parallel copy implementation.
> >
>
> How did you ensure that this is sufficient? For parallel-insert's
> patch we have enabled parallel-mode for Inserts and ran the tests with
> force_parallel_mode to see if we are not missing anything. Also, it
> seems there are many common things here w.r.t parallel-insert patch,
> is it possible to prepare this atop that patch or do you have any
> reason to keep this separate?
>

I have done similar testing for copy too, I had set force_parallel
mode to regress, hardcoded in the code to pick parallel workers for
copy operation and ran make installcheck-world to verify. Many checks
in this patch are common between both patches, but I was not sure how
to handle it as both the projects are in-progress and are being
updated based on the reviewer's opinion. How to handle this?
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-11-10 Thread Amit Kapila

On Tue, Nov 10, 2020 at 7:12 PM vignesh C  wrote:
>
> On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  wrote:
> >
>
> I have worked to provide a patch for the parallel safety checks. It
> checks if parallely copy can be performed, Parallel copy cannot be
> performed for the following a) If relation is temporary table b) If
> relation is foreign table c) If relation has non parallel safe index
> expressions d) If relation has triggers present whose type is of non
> before statement trigger type e) If relation has check constraint
> which are not parallel safe f) If relation has partition and any
> partition has the above type. This patch has the checks for it. This
> patch will be used by parallel copy implementation.
>

How did you ensure that this is sufficient? For parallel-insert's
patch we have enabled parallel-mode for Inserts and ran the tests with
force_parallel_mode to see if we are not missing anything. Also, it
seems there are many common things here w.r.t parallel-insert patch,
is it possible to prepare this atop that patch or do you have any
reason to keep this separate?

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

2020-11-10 Thread vignesh C

On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  wrote:
>
> On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas  wrote:
> >
> > On 02/11/2020 08:14, Amit Kapila wrote:
> > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  
> > > wrote:
> > >>
> > >> In this design, you don't need to keep line boundaries in shared memory,
> > >> because each worker process is responsible for finding the line
> > >> boundaries of its own block.
> > >>
> > >> There's a point of serialization here, in that the next block cannot be
> > >> processed, until the worker working on the previous block has finished
> > >> scanning the EOLs, and set the starting position on the next block,
> > >> putting it in READY state. That's not very different from your patch,
> > >> where you had a similar point of serialization because the leader
> > >> scanned the EOLs,
> > >
> > > But in the design (single producer multiple consumer) used by the
> > > patch the worker doesn't need to wait till the complete block is
> > > processed, it can start processing the lines already found. This will
> > > also allow workers to start much earlier to process the data as it
> > > doesn't need to wait for all the offsets corresponding to 64K block
> > > ready. However, in the design where each worker is processing the 64K
> > > block, it can lead to much longer waits. I think this will impact the
> > > Copy STDIN case more where in most cases (200-300 bytes tuples) we
> > > receive line-by-line from client and find the line-endings by leader.
> > > If the leader doesn't find the line-endings the workers need to wait
> > > till the leader fill the entire 64K chunk, OTOH, with current approach
> > > the worker can start as soon as leader is able to populate some
> > > minimum number of line-endings
> >
> > You can use a smaller block size.
> >
>
> Sure, but the same problem can happen if the last line in that block
> is too long and we need to peek into the next block. And then there
> could be cases where a single line could be greater than 64K.
>
> > However, the point of parallel copy is
> > to maximize bandwidth.
> >
>
> Okay, but this first-phase (finding the line boundaries) can anyway be
> not done in parallel and we have seen in some of the initial
> benchmarking that this initial phase is a small part of work
> especially when the table has indexes, constraints, etc. So, I think
> it won't matter much if this splitting is done in a single process or
> multiple processes.
>
> > If the workers ever have to sit idle, it means
> > that the bottleneck is in receiving data from the client, i.e. the
> > backend is fast enough, and you can't make the overall COPY finish any
> > faster no matter how you do it.
> >
> > > The other point is that the leader backend won't be used completely as
> > > it is only doing a very small part (primarily reading the file) of the
> > > overall work.
> >
> > An idle process doesn't cost anything. If you have free CPU resources,
> > use more workers.
> >
> > > We have discussed both these approaches (a) single producer multiple
> > > consumer, and (b) all workers doing the processing as you are saying
> > > in the beginning and concluded that (a) is better, see some of the
> > > relevant emails [1][2][3].
> > >
> > > [1] - 
> > > https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
> > > [2] - 
> > > https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
> > > [3] - 
> > > https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de
> >
> > Sorry I'm late to the party. I don't think the design I proposed was
> > discussed in that threads.
> >
>
> I think something close to that is discussed as you have noticed in
> your next email but IIRC, because many people (Andres, Ants, myself
> and author) favoured the current approach (single reader and multiple
> consumers) we decided to go with that. I feel this patch is very much
> in the POC stage due to which the code doesn't look good and as we
> move forward we need to see what is the better way to improve it,
> maybe one of the ways is to split it as you are suggesting so that it
> can be easier to review. I think the other important thing which this
> patch has not addressed properly is the parallel-safety checks as
> pointed by me earlier. There are two things to solve there (a) the

Re: Parallel copy

2020-11-07 Thread vignesh C

On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie  wrote:
>
> Hi
>
> >
> > my $bytes = $ARGV[0];
> > for(my $i = 0; $i < $bytes; $i+=8){
> >  print "longdata";
> > }
> > print "\n";
> > 
> >
> > postgres=# copy longdata from program 'perl /tmp/longdata.pl 1'
> > with (parallel 2);
> >
> > This gets stuck forever (or at least I didn't have the patience to wait
> > it finish). Both worker processes are consuming 100% of CPU.
>
> I had a look over this problem.
>
> the ParallelCopyDataBlock has size limit:
> uint8   skip_bytes;
> chardata[DATA_BLOCK_SIZE];  /* data read from file */
>
> It seems the input line is so long that the leader process run out of the 
> Shared memory among parallel copy workers.
> And the leader process keep waiting free block.
>
> For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED,
> But leader process won't set the line_state unless it read the whole line.
>
> So it stuck forever.
> May be we should reconsider about this situation.
>
> The stack is as follows:
>
> Leader stack:
> #3  0x0075f7a1 in WaitLatch (latch=, 
> wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, 
> wait_event_info=wait_event_info@entry=150994945) at latch.c:411
> #4  0x005a9245 in WaitGetFreeCopyBlock 
> (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546
> #5  0x005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, 
> line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, 
> raw_buf_ptr=raw_buf_ptr@entry=65536,
> copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572
> #6  0x005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at 
> copy.c:4058
> #7  0x005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at 
> copy.c:3863
>
> Worker stack:
> #0  GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474
> #1  0x005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, 
> buff_count=buff_count@entry=0) at copyparallel.c:711
> #2  0x005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at 
> copyparallel.c:885
> #3  0x005a4f2e in NextCopyFromRawFields 
> (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48, 
> nfields=nfields@entry=0x7fff4cdc0b44) at copy.c:3615
> #4  0x005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, 
> econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at 
> copy.c:3696
> #5  0x005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at 
> copy.c:2985
>

Thanks for providing your thoughts. I have analyzed this issue and I'm
working on the fix for this, I will be posting a patch for this
shortly.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel copy

2020-11-05 Thread Hou, Zhijie

Hi

> 
> my $bytes = $ARGV[0];
> for(my $i = 0; $i < $bytes; $i+=8){
>  print "longdata";
> }
> print "\n";
> 
> 
> postgres=# copy longdata from program 'perl /tmp/longdata.pl 1'
> with (parallel 2);
> 
> This gets stuck forever (or at least I didn't have the patience to wait
> it finish). Both worker processes are consuming 100% of CPU.

I had a look over this problem.

the ParallelCopyDataBlock has size limit:
uint8   skip_bytes;
chardata[DATA_BLOCK_SIZE];  /* data read from file */

It seems the input line is so long that the leader process run out of the 
Shared memory among parallel copy workers.
And the leader process keep waiting free block.

For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED,
But leader process won't set the line_state unless it read the whole line.

So it stuck forever.
May be we should reconsider about this situation.

The stack is as follows:

Leader stack:
#3  0x0075f7a1 in WaitLatch (latch=, 
wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, 
wait_event_info=wait_event_info@entry=150994945) at latch.c:411
#4  0x005a9245 in WaitGetFreeCopyBlock 
(pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546
#5  0x005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, 
line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, 
raw_buf_ptr=raw_buf_ptr@entry=65536, 
copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572
#6  0x005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at 
copy.c:4058
#7  0x005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at 
copy.c:3863

Worker stack:
#0  GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474
#1  0x005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, 
buff_count=buff_count@entry=0) at copyparallel.c:711
#2  0x005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at 
copyparallel.c:885
#3  0x005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, 
fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44) at 
copy.c:3615
#4  0x005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, 
econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at 
copy.c:3696
#5  0x005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at 
copy.c:2985


Best regards,
houzj

Re: Parallel copy

2020-11-03 Thread Heikki Linnakangas


On 03/11/2020 10:59, Amit Kapila wrote:

On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas  wrote:

However, the point of parallel copy is to maximize bandwidth.


Okay, but this first-phase (finding the line boundaries) can anyway
be not done in parallel and we have seen in some of the initial 
benchmarking that this initial phase is a small part of work 
especially when the table has indexes, constraints, etc. So, I think 
it won't matter much if this splitting is done in a single process

or multiple processes.
Right, it won't matter performance-wise. That's not my point. The 
difference is in the complexity. If you don't store the line boundaries 
in shared memory, you get away with much simpler shared memory structures.



I think something close to that is discussed as you have noticed in
your next email but IIRC, because many people (Andres, Ants, myself
and author) favoured the current approach (single reader and multiple
consumers) we decided to go with that. I feel this patch is very much
in the POC stage due to which the code doesn't look good and as we
move forward we need to see what is the better way to improve it,
maybe one of the ways is to split it as you are suggesting so that it
can be easier to review.


Sure. I think the roadmap here is:

1. Split copy.c [1]. Not strictly necessary, but I think it'd make this 
nice to review and work with.


2. Refactor CopyReadLine(), so that finding the line-endings and the 
rest of the line-parsing are separated into separate functions.


3. Implement parallel copy.


I think the other important thing which this
patch has not addressed properly is the parallel-safety checks as
pointed by me earlier. There are two things to solve there (a) the
lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
APIs, etc.) have checks which doesn't allow any writes, we need to see
which of those we can open now (or do some additional work to prevent
from those checks) after some of the work done for parallel-writes in
PG-13[1][2], and (b) in which all cases we can parallel-writes
(parallel copy) is allowed, for example need to identify whether table
or one of its partitions has any constraint/expression which is
parallel-unsafe.


Agreed, that needs to be solved. I haven't given it any thought myself.

- Heikki

[1] 
https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi

Re: Parallel copy

2020-11-03 Thread Amit Kapila

On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas  wrote:
>
> On 02/11/2020 08:14, Amit Kapila wrote:
> > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  wrote:
> >>
> >> In this design, you don't need to keep line boundaries in shared memory,
> >> because each worker process is responsible for finding the line
> >> boundaries of its own block.
> >>
> >> There's a point of serialization here, in that the next block cannot be
> >> processed, until the worker working on the previous block has finished
> >> scanning the EOLs, and set the starting position on the next block,
> >> putting it in READY state. That's not very different from your patch,
> >> where you had a similar point of serialization because the leader
> >> scanned the EOLs,
> >
> > But in the design (single producer multiple consumer) used by the
> > patch the worker doesn't need to wait till the complete block is
> > processed, it can start processing the lines already found. This will
> > also allow workers to start much earlier to process the data as it
> > doesn't need to wait for all the offsets corresponding to 64K block
> > ready. However, in the design where each worker is processing the 64K
> > block, it can lead to much longer waits. I think this will impact the
> > Copy STDIN case more where in most cases (200-300 bytes tuples) we
> > receive line-by-line from client and find the line-endings by leader.
> > If the leader doesn't find the line-endings the workers need to wait
> > till the leader fill the entire 64K chunk, OTOH, with current approach
> > the worker can start as soon as leader is able to populate some
> > minimum number of line-endings
>
> You can use a smaller block size.
>

Sure, but the same problem can happen if the last line in that block
is too long and we need to peek into the next block. And then there
could be cases where a single line could be greater than 64K.

> However, the point of parallel copy is
> to maximize bandwidth.
>

Okay, but this first-phase (finding the line boundaries) can anyway be
not done in parallel and we have seen in some of the initial
benchmarking that this initial phase is a small part of work
especially when the table has indexes, constraints, etc. So, I think
it won't matter much if this splitting is done in a single process or
multiple processes.

> If the workers ever have to sit idle, it means
> that the bottleneck is in receiving data from the client, i.e. the
> backend is fast enough, and you can't make the overall COPY finish any
> faster no matter how you do it.
>
> > The other point is that the leader backend won't be used completely as
> > it is only doing a very small part (primarily reading the file) of the
> > overall work.
>
> An idle process doesn't cost anything. If you have free CPU resources,
> use more workers.
>
> > We have discussed both these approaches (a) single producer multiple
> > consumer, and (b) all workers doing the processing as you are saying
> > in the beginning and concluded that (a) is better, see some of the
> > relevant emails [1][2][3].
> >
> > [1] - 
> > https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
> > [2] - 
> > https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
> > [3] - 
> > https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de
>
> Sorry I'm late to the party. I don't think the design I proposed was
> discussed in that threads.
>

I think something close to that is discussed as you have noticed in
your next email but IIRC, because many people (Andres, Ants, myself
and author) favoured the current approach (single reader and multiple
consumers) we decided to go with that. I feel this patch is very much
in the POC stage due to which the code doesn't look good and as we
move forward we need to see what is the better way to improve it,
maybe one of the ways is to split it as you are suggesting so that it
can be easier to review. I think the other important thing which this
patch has not addressed properly is the parallel-safety checks as
pointed by me earlier. There are two things to solve there (a) the
lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
APIs, etc.) have checks which doesn't allow any writes, we need to see
which of those we can open now (or do some additional work to prevent
from those checks) after some of the work done for parallel-writes in
PG-13[1][2], and (b) in which all cases we can parallel-writes
(parallel copy) is allowed, for example need to identify whether table
or one of its partitions has any constraint/expression which is
parallel-unsafe.

Re: Parallel copy

2020-11-01 Thread Heikki Linnakangas

On 02/11/2020 09:10, Heikki Linnakangas wrote:

On 02/11/2020 08:14, Amit Kapila wrote:

We have discussed both these approaches (a) single producer multiple
consumer, and (b) all workers doing the processing as you are saying
in the beginning and concluded that (a) is better, see some of the
relevant emails [1][2][3].

[1] -
https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
[2] -
https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
[3] -
https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de

Sorry I'm late to the party. I don't think the design I proposed was
discussed in that threads. The alternative that's discussed in that
thread seems to be something much more fine-grained, where processes
claim individual lines. I'm not sure though, I didn't fully understand
the alternative designs.

I read the thread more carefully, and I think Robert had basically the
right idea here
(https://www.postgresql.org/message-id/CA%2BTgmoZMU4az9MmdJtg04pjRa0wmWQtmoMxttdxNrupYJNcR3w%40mail.gmail.com):

I really think we don't want a single worker in charge of finding
tuple boundaries for everybody. That adds a lot of unnecessary
inter-process communication and synchronization. Each process should
just get the next tuple starting after where the last one ended, and
then advance the end pointer so that the next process can do the same
thing. [...]

And here
(https://www.postgresql.org/message-id/CA%2BTgmoZw%2BF3y%2BoaxEsHEZBxdL1x1KAJ7pRMNgCqX0WjmjGNLrA%40mail.gmail.com):

On Thu, Apr 9, 2020 at 2:55 PM Andres Freund

wrote:

I'm fairly certain that we do *not* want to distribute input data
between processes on a single tuple basis. Probably not even below
a few

hundred kb. If there's any sort of natural clustering in the loaded data
- extremely common, think timestamps - splitting on a granular basis
will make indexing much more expensive. And have a lot more contention.

That's a fair point. I think the solution ought to be that once any
process starts finding line endings, it continues until it's grabbed
at least a certain amount of data for itself. Then it stops and lets
some other process grab a chunk of data.
Yes! That's pretty close to the design I sketched. I imagined that the
leader would divide the input into 64 kB blocks, and each block would
have few metadata fields, notably the starting position of the first
line in the block. I think Robert envisioned having a single "next
starting position" field in shared memory. That works too, and is even
simpler, so +1 for that.

For some reason, the discussion took a different turn from there, to
discuss how the line-endings (called "chunks" in the discussion) should
be represented in shared memory. But none of that is necessary with
Robert's design.

- Heikki

Re: Parallel copy

2020-11-01 Thread Heikki Linnakangas


On 02/11/2020 08:14, Amit Kapila wrote:

On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  wrote:


Leader process:

The leader process is simple. It picks the next FREE buffer, fills it
with raw data from the file, and marks it as FILLED. If no buffers are
FREE, wait.

Worker process:

1. Claim next READY block from queue, by changing its state to
 PROCESSING. If the next block is not READY yet, wait until it is.

2. Start scanning the block from 'startpos', finding end-of-line
 markers. (in CSV mode, need to track when we're in-quotes).

3. When you reach the end of the block, if the last line continues to
 next block, wait for the next block to become FILLED. Peek into the
 next block, and copy the remaining part of the split line to a local
 buffer, and set the 'startpos' on the next block to point to the end
 of the split line. Mark the next block as READY.

4. Process all the lines in the block, call input functions, insert
 rows.

5. Mark the block as DONE.

In this design, you don't need to keep line boundaries in shared memory,
because each worker process is responsible for finding the line
boundaries of its own block.

There's a point of serialization here, in that the next block cannot be
processed, until the worker working on the previous block has finished
scanning the EOLs, and set the starting position on the next block,
putting it in READY state. That's not very different from your patch,
where you had a similar point of serialization because the leader
scanned the EOLs,


But in the design (single producer multiple consumer) used by the
patch the worker doesn't need to wait till the complete block is
processed, it can start processing the lines already found. This will
also allow workers to start much earlier to process the data as it
doesn't need to wait for all the offsets corresponding to 64K block
ready. However, in the design where each worker is processing the 64K
block, it can lead to much longer waits. I think this will impact the
Copy STDIN case more where in most cases (200-300 bytes tuples) we
receive line-by-line from client and find the line-endings by leader.
If the leader doesn't find the line-endings the workers need to wait
till the leader fill the entire 64K chunk, OTOH, with current approach
the worker can start as soon as leader is able to populate some
minimum number of line-endings


You can use a smaller block size. However, the point of parallel copy is 
to maximize bandwidth. If the workers ever have to sit idle, it means 
that the bottleneck is in receiving data from the client, i.e. the 
backend is fast enough, and you can't make the overall COPY finish any 
faster no matter how you do it.



The other point is that the leader backend won't be used completely as
it is only doing a very small part (primarily reading the file) of the
overall work.


An idle process doesn't cost anything. If you have free CPU resources, 
use more workers.



We have discussed both these approaches (a) single producer multiple
consumer, and (b) all workers doing the processing as you are saying
in the beginning and concluded that (a) is better, see some of the
relevant emails [1][2][3].

[1] - 
https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
[2] - 
https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
[3] - 
https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de


Sorry I'm late to the party. I don't think the design I proposed was 
discussed in that threads. The alternative that's discussed in that 
thread seems to be something much more fine-grained, where processes 
claim individual lines. I'm not sure though, I didn't fully understand 
the alternative designs.


I want to throw out one more idea. It's an interim step, not the 	 final 
solution we want, but a useful step in getting there:


Have the leader process scan the input for line-endings. Split the input 
data into blocks of slightly under 64 kB in size, so that a line never 
crosses a block. Put the blocks in shared memory.


A worker process claims a block from shared memory, processes it from 
beginning to end. It *also* has to parse the input to split it into lines.


In this design, the line-splitting is done twice. That's clearly not 
optimal, and we want to avoid that in the final patch, but I think it 
would be a useful milestone. After that patch is done, write another 
patch to either a) implement the design I sketched, where blocks are 
fixed-size and a worker notifies the next worker on where the first line 
in next block begins, or b) have the leader process report the 
line-ending positions in shared memory, so that workers don't need to 
scan them again.


Even if we apply the patches together, I think splitting them like that 
would make for easier review.


- Heikki

Re: Parallel copy

2020-11-01 Thread Amit Kapila

On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  wrote:
>
> Leader process:
>
> The leader process is simple. It picks the next FREE buffer, fills it
> with raw data from the file, and marks it as FILLED. If no buffers are
> FREE, wait.
>
> Worker process:
>
> 1. Claim next READY block from queue, by changing its state to
> PROCESSING. If the next block is not READY yet, wait until it is.
>
> 2. Start scanning the block from 'startpos', finding end-of-line
> markers. (in CSV mode, need to track when we're in-quotes).
>
> 3. When you reach the end of the block, if the last line continues to
> next block, wait for the next block to become FILLED. Peek into the
> next block, and copy the remaining part of the split line to a local
> buffer, and set the 'startpos' on the next block to point to the end
> of the split line. Mark the next block as READY.
>
> 4. Process all the lines in the block, call input functions, insert
> rows.
>
> 5. Mark the block as DONE.
>
> In this design, you don't need to keep line boundaries in shared memory,
> because each worker process is responsible for finding the line
> boundaries of its own block.
>
> There's a point of serialization here, in that the next block cannot be
> processed, until the worker working on the previous block has finished
> scanning the EOLs, and set the starting position on the next block,
> putting it in READY state. That's not very different from your patch,
> where you had a similar point of serialization because the leader
> scanned the EOLs,
>

But in the design (single producer multiple consumer) used by the
patch the worker doesn't need to wait till the complete block is
processed, it can start processing the lines already found. This will
also allow workers to start much earlier to process the data as it
doesn't need to wait for all the offsets corresponding to 64K block
ready. However, in the design where each worker is processing the 64K
block, it can lead to much longer waits. I think this will impact the
Copy STDIN case more where in most cases (200-300 bytes tuples) we
receive line-by-line from client and find the line-endings by leader.
If the leader doesn't find the line-endings the workers need to wait
till the leader fill the entire 64K chunk, OTOH, with current approach
the worker can start as soon as leader is able to populate some
minimum number of line-endings

The other point is that the leader backend won't be used completely as
it is only doing a very small part (primarily reading the file) of the
overall work.

We have discussed both these approaches (a) single producer multiple
consumer, and (b) all workers doing the processing as you are saying
in the beginning and concluded that (a) is better, see some of the
relevant emails [1][2][3].

[1] - 
https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
[2] - 
https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
[3] - 
https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

2020-10-31 Thread Tomas Vondra


On Sat, Oct 31, 2020 at 12:09:32AM +0200, Heikki Linnakangas wrote:

On 30/10/2020 22:56, Tomas Vondra wrote:

I agree this design looks simpler. I'm a bit worried about serializing
the parsing like this, though. It's true the current approach (where the
first phase of parsing happens in the leader) has a similar issue, but I
think it would be easier to improve that in that design.

My plan was to parallelize the parsing roughly like this:

1) split the input buffer into smaller chunks

2) let workers scan the buffers and record positions of interesting
characters (delimiters, quotes, ...) and pass it back to the leader

3) use the information to actually parse the input data (we only need to
look at the interesting characters, skipping large parts of data)

4) pass the parsed chunks to workers, just like in the current patch


But maybe something like that would be possible even with the approach
you propose - we could have a special parse phase for processing each
buffer, where any worker could look for the special characters, record
the positions in a bitmap next to the buffer. So the whole sequence of
states would look something like this:

 EMPTY
 FILLED
 PARSED
 READY
 PROCESSING


I think it's even simpler than that. You don't need to communicate the 
"interesting positions" between processes, if the same worker takes 
care of the chunk through all states from FILLED to DONE.


You can build the bitmap of interesting positions immediately in 
FILLED state, independently of all previous blocks. Once you've built 
the bitmap, you need to wait for the information on where the first 
line starts, but presumably finding the interesting positions is the 
expensive part.




I don't think it's that simple. For example, the previous block may
contain a very long value (say, 1MB), so a bunch of blocks have to be
processed by the same worker. That probably makes the state transitions
a bit, and it also means the bitmap would need to be passed to the
worker that actually processes the block. Or we might just ignore this,
on the grounds that it's not a very common situation.



Of course, the question is whether parsing really is sufficiently
expensive for this to be worth it.


Yeah, I don't think it's worth it. Splitting the lines is pretty fast, 
I think we have many years to come before that becomes a bottleneck. 
But if it turns out I'm wrong and we need to implement that, the path 
is pretty straightforward.




OK. I agree the parsing is relatively cheap, and I don't recall seeing
CSV parsing as a bottleneck in production.  I suspect that's might be
simply because we're hitting other bottlenecks first, though.

regard

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel copy


On 30/10/2020 22:56, Tomas Vondra wrote:

I agree this design looks simpler. I'm a bit worried about serializing
the parsing like this, though. It's true the current approach (where the
first phase of parsing happens in the leader) has a similar issue, but I
think it would be easier to improve that in that design.

My plan was to parallelize the parsing roughly like this:

1) split the input buffer into smaller chunks

2) let workers scan the buffers and record positions of interesting
characters (delimiters, quotes, ...) and pass it back to the leader

3) use the information to actually parse the input data (we only need to
look at the interesting characters, skipping large parts of data)

4) pass the parsed chunks to workers, just like in the current patch


But maybe something like that would be possible even with the approach
you propose - we could have a special parse phase for processing each
buffer, where any worker could look for the special characters, record
the positions in a bitmap next to the buffer. So the whole sequence of
states would look something like this:

  EMPTY
  FILLED
  PARSED
  READY
  PROCESSING


I think it's even simpler than that. You don't need to communicate the 
"interesting positions" between processes, if the same worker takes care 
of the chunk through all states from FILLED to DONE.


You can build the bitmap of interesting positions immediately in FILLED 
state, independently of all previous blocks. Once you've built the 
bitmap, you need to wait for the information on where the first line 
starts, but presumably finding the interesting positions is the 
expensive part.



Of course, the question is whether parsing really is sufficiently
expensive for this to be worth it.


Yeah, I don't think it's worth it. Splitting the lines is pretty fast, I 
think we have many years to come before that becomes a bottleneck. But 
if it turns out I'm wrong and we need to implement that, the path is 
pretty straightforward.


- Heikki

Re: Parallel copy

2020-10-30 Thread Tomas Vondra


On Fri, Oct 30, 2020 at 06:41:41PM +0200, Heikki Linnakangas wrote:

On 30/10/2020 18:36, Heikki Linnakangas wrote:

I find this design to be very complicated. Why does the line-boundary
information need to be in shared memory? I think this would be much
simpler if each worker grabbed a fixed-size block of raw data, and
processed that.

In your patch, the leader process scans the input to find out where one
line ends and another begins, and because of that decision, the leader
needs to make the line boundaries available in shared memory, for the
worker processes. If we moved that responsibility to the worker
processes, you wouldn't need to keep the line boundaries in shared
memory. A worker would only need to pass enough state to the next worker
to tell it where to start scanning the next block.


Here's a high-level sketch of how I'm imagining this to work:

The shared memory structure consists of a queue of blocks, arranged as 
a ring buffer. Each block is of fixed size, and contains 64 kB of 
data, and a few fields for coordination:


typedef struct
{
   /* Current state of the block */
   pg_atomic_uint32 state;

   /* starting offset of first line within the block */
   int startpos;

   chardata[64 kB];
} ParallelCopyDataBlock;

Where state is one of:

enum {
 FREE,   /* buffer is empty */
 FILLED, /* leader has filled the buffer with raw data */
 READY,  /* start pos has been filled in, but no worker process 
has claimed the block yet */

 PROCESSING, /* worker has claimed the block, and is processing it */
}

State changes FREE -> FILLED -> READY -> PROCESSING -> FREE. As the 
COPY progresses, the ring of blocks will always look something like 
this:


blk 0 startpos  0: PROCESSING [worker 1]
blk 1 startpos 12: PROCESSING [worker 2]
blk 2 startpos 10: READY
blk 3 starptos  -: FILLED
blk 4 startpos  -: FILLED
blk 5 starptos  -: FILLED
blk 6 startpos  -: FREE
blk 7 startpos  -: FREE

Typically, each worker process is busy processing a block. After the 
blocks being processed, there is one block in READY state, and after 
that, blocks in FILLED state.


Leader process:

The leader process is simple. It picks the next FREE buffer, fills it 
with raw data from the file, and marks it as FILLED. If no buffers are 
FREE, wait.


Worker process:

1. Claim next READY block from queue, by changing its state to
  PROCESSING. If the next block is not READY yet, wait until it is.

2. Start scanning the block from 'startpos', finding end-of-line
  markers. (in CSV mode, need to track when we're in-quotes).

3. When you reach the end of the block, if the last line continues to
  next block, wait for the next block to become FILLED. Peek into the
  next block, and copy the remaining part of the split line to a local
  buffer, and set the 'startpos' on the next block to point to the end
  of the split line. Mark the next block as READY.

4. Process all the lines in the block, call input functions, insert
  rows.

5. Mark the block as DONE.

In this design, you don't need to keep line boundaries in shared 
memory, because each worker process is responsible for finding the 
line boundaries of its own block.


There's a point of serialization here, in that the next block cannot 
be processed, until the worker working on the previous block has 
finished scanning the EOLs, and set the starting position on the next 
block, putting it in READY state. That's not very different from your 
patch, where you had a similar point of serialization because the 
leader scanned the EOLs, but I think the coordination between 
processes is simpler here.




I agree this design looks simpler. I'm a bit worried about serializing
the parsing like this, though. It's true the current approach (where the
first phase of parsing happens in the leader) has a similar issue, but I
think it would be easier to improve that in that design.

My plan was to parallelize the parsing roughly like this:

1) split the input buffer into smaller chunks

2) let workers scan the buffers and record positions of interesting
characters (delimiters, quotes, ...) and pass it back to the leader

3) use the information to actually parse the input data (we only need to
look at the interesting characters, skipping large parts of data)

4) pass the parsed chunks to workers, just like in the current patch


But maybe something like that would be possible even with the approach
you propose - we could have a special parse phase for processing each
buffer, where any worker could look for the special characters, record
the positions in a bitmap next to the buffer. So the whole sequence of
states would look something like this:

EMPTY
FILLED
PARSED
READY
PROCESSING

Of course, the question is whether parsing really is sufficiently
expensive for this to be worth it.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel copy

2020-10-30 Thread Tomas Vondra


Hi,

I've done a bit more testing today, and I think the parsing is busted in
some way. Consider this:

test=# create extension random;
CREATE EXTENSION

test=# create table t (a text);

CREATE TABLE

test=# insert into t select random_string(random_int(10, 256*1024)) from generate_series(1,1);

INSERT 0 1

test=# copy t to '/mnt/data/t.csv';

COPY 1

test=# truncate t;

TRUNCATE TABLE

test=# copy t from '/mnt/data/t.csv';

COPY 1

test=# truncate t;

TRUNCATE TABLE

test=# copy t from '/mnt/data/t.csv' with (parallel 2);

ERROR:  invalid byte sequence for encoding "UTF8": 0x00
CONTEXT:  COPY t, line 485: 
"m&\nh%_a"%r]>qtCl:Q5ltvF~;2oS6@HB>F>og,bD$Lw'nZY\tYl#BH\t{(j~ryoZ08"SGU~.}8CcTRk1\ts$@U3szCC+U1U3i@P..."
parallel worker


The functions come from an extension I use to generate random data, I've
pushed it to github [1]. The random_string() generates a random string
with ASCII characters, symbols and a couple special characters (\r\n\t).
The intent was to try loading data where a fields may span multiple 64kB
blocks and may contain newlines etc.

The non-parallel copy works fine, the parallel one fails. I haven't
investigated the details, but I guess it gets confused about where a
string starts/end, or something like that.


[1] https://github.com/tvondra/random


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel copy


On 30/10/2020 18:36, Heikki Linnakangas wrote:

Whether the leader process finds the EOLs or the worker processes, it's
pretty clear that it needs to be done ASAP, for a chunk at a time,
because that cannot be done in parallel. I think some refactoring in
CopyReadLine() and friends would be in order. It probably would be
faster, or at least not slower, to find all the EOLs in a block in one
tight loop, even when parallel copy is not used.


Something like the attached. It passes the regression tests, but it's 
quite incomplete. It's missing handing of "\." as end-of-file marker, 
and I haven't tested encoding conversions at all, for starters. Quick 
testing suggests that this a little bit faster than the current code, 
but the difference is small; I had to use a "WHERE false" option to 
really see the difference.


The crucial thing here is that there's a new function, ParseLinesText(), 
to find all end-of-line characters in a buffer in one go. In this patch, 
it's used against 'raw_buf', but with parallel copy, you could point it 
at a block in shared memory instead.


- Heikki
>From af3be3bd4e77b66f4605393617da0d15ec21e15b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 30 Oct 2020 18:51:10 +0200
Subject: [PATCH 1/1] WIP: Find all line-endings in COPY in chunks.

Refactor CopyReadLines and friends to find all the line-endings in the
buffer in one go, before splitting the lines further.
---
 src/backend/commands/copy.c | 972 
 1 file changed, 536 insertions(+), 436 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 36ddcdccdb8..fbf11cb2550 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -95,6 +95,18 @@ typedef enum CopyInsertMethod
 	CIM_MULTI_CONDITIONAL		/* use table_multi_insert only if valid */
 } CopyInsertMethod;
 
+
+/*
+ * Represents the heap insert method to be used during COPY FROM.
+ */
+typedef enum ParseLinesState
+{
+	PLSTATE_NORMAL,
+	PLSTATE_ESCAPE,
+	PLSTATE_IN_QUOTE,
+	PLSTATE_ESCAPE_IN_QUOTE
+} ParseLinesState;
+
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -110,6 +122,24 @@ typedef enum CopyInsertMethod
  * it's faster to make useless comparisons to trailing bytes than it is to
  * invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is true
  * when we have to do it the hard way.
+ *
+ * COPY FROM buffers:
+ *
+ * In COPY FROM processing, there are three levels of buffers:
+ *
+ * raw_buf   - contains raw data read from file/client
+ * converted_buf - contains the data in 'raw_buf', but converted to server encoding
+ * line_buf  - contains "current" line of data, without the end-of-line char
+ *
+ *
+ * In simple cases, no encoding conversion are needed, and converted_buf always
+ * points to raw_buf. If the encoding_embeds_ascii==true, encoding conversion is
+ * performed on the raw buffer, before splitting it to lines. converted_buf contains
+ * the converted version in that case.
+ *
+ * Usually, line_buf pointer points in the middle of converted_buf, but when a line
+ * is split by a raw-buffer boundary, the incomplete line is reassembled
+ * in a separate buffer (split_line_buf), and line_buf points to that.
  */
 typedef struct CopyStateData
 {
@@ -205,16 +235,34 @@ typedef struct CopyStateData
 	char	  **raw_fields;
 
 	/*
-	 * Similarly, line_buf holds the whole input line being processed. The
+	 * These variables are used to track state of parsing raw data into
+	 * lines in COPY FROM.
+	 */
+	bool		last_was_cr;
+	ParseLinesState parse_lines_state;
+
+	int			last_line_no; /* last line in 'endlines', -1 if EOF not reached yet */
+
+	int			nextline;
+	int		   *endlines; /* line ending positions within raw_buf */
+	int			numlines;
+
+	/* split_line_buf holds partial line carried over from previous buf */
+	StringInfoData split_line_buf;
+
+	/*
+	 * Similarly, line_buf holds the current input line being processed. The
 	 * input cycle is first to read the whole line into line_buf, convert it
 	 * to server encoding there, and then extract the individual attribute
 	 * fields into attribute_buf.  line_buf is preserved unmodified so that we
 	 * can display it in error messages if appropriate.  (In binary mode,
 	 * line_buf is not used.)
 	 */
-	StringInfoData line_buf;
+	char	   *line_buf;
+	int			line_len;
 	bool		line_buf_converted; /* converted to server encoding? */
 	bool		line_buf_valid; /* contains the row being processed? */
+	bool		line_buf_alloced;
 
 	/*
 	 * Finally, raw_buf holds raw data read from the data source (file or
@@ -230,6 +278,9 @@ typedef struct CopyStateData
 	int			raw_buf_len;	/* total # of bytes stored */
 	/* Shorthand for number of unconsumed bytes available in raw_buf */
 #define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - (cstate)->raw_buf_i

Re: Parallel copy


On 30/10/2020 18:36, Heikki Linnakangas wrote:

I find this design to be very complicated. Why does the line-boundary
information need to be in shared memory? I think this would be much
simpler if each worker grabbed a fixed-size block of raw data, and
processed that.

In your patch, the leader process scans the input to find out where one
line ends and another begins, and because of that decision, the leader
needs to make the line boundaries available in shared memory, for the
worker processes. If we moved that responsibility to the worker
processes, you wouldn't need to keep the line boundaries in shared
memory. A worker would only need to pass enough state to the next worker
to tell it where to start scanning the next block.


Here's a high-level sketch of how I'm imagining this to work:

The shared memory structure consists of a queue of blocks, arranged as a 
ring buffer. Each block is of fixed size, and contains 64 kB of data, 
and a few fields for coordination:


typedef struct
{
/* Current state of the block */
pg_atomic_uint32 state;

/* starting offset of first line within the block */
int startpos;

chardata[64 kB];
} ParallelCopyDataBlock;

Where state is one of:

enum {
  FREE,   /* buffer is empty */
  FILLED, /* leader has filled the buffer with raw data */
  READY,  /* start pos has been filled in, but no worker process 
has claimed the block yet */

  PROCESSING, /* worker has claimed the block, and is processing it */
}

State changes FREE -> FILLED -> READY -> PROCESSING -> FREE. As the COPY 
progresses, the ring of blocks will always look something like this:


blk 0 startpos  0: PROCESSING [worker 1]
blk 1 startpos 12: PROCESSING [worker 2]
blk 2 startpos 10: READY
blk 3 starptos  -: FILLED
blk 4 startpos  -: FILLED
blk 5 starptos  -: FILLED
blk 6 startpos  -: FREE
blk 7 startpos  -: FREE

Typically, each worker process is busy processing a block. After the 
blocks being processed, there is one block in READY state, and after 
that, blocks in FILLED state.


Leader process:

The leader process is simple. It picks the next FREE buffer, fills it 
with raw data from the file, and marks it as FILLED. If no buffers are 
FREE, wait.


Worker process:

1. Claim next READY block from queue, by changing its state to
   PROCESSING. If the next block is not READY yet, wait until it is.

2. Start scanning the block from 'startpos', finding end-of-line
   markers. (in CSV mode, need to track when we're in-quotes).

3. When you reach the end of the block, if the last line continues to
   next block, wait for the next block to become FILLED. Peek into the
   next block, and copy the remaining part of the split line to a local
   buffer, and set the 'startpos' on the next block to point to the end
   of the split line. Mark the next block as READY.

4. Process all the lines in the block, call input functions, insert
   rows.

5. Mark the block as DONE.

In this design, you don't need to keep line boundaries in shared memory, 
because each worker process is responsible for finding the line 
boundaries of its own block.


There's a point of serialization here, in that the next block cannot be 
processed, until the worker working on the previous block has finished 
scanning the EOLs, and set the starting position on the next block, 
putting it in READY state. That's not very different from your patch, 
where you had a similar point of serialization because the leader 
scanned the EOLs, but I think the coordination between processes is 
simpler here.


- Heikki

Re: Parallel copy


On 27/10/2020 15:36, vignesh C wrote:

Attached v9 patches have the fixes for the above comments.


I find this design to be very complicated. Why does the line-boundary 
information need to be in shared memory? I think this would be much 
simpler if each worker grabbed a fixed-size block of raw data, and 
processed that.


In your patch, the leader process scans the input to find out where one 
line ends and another begins, and because of that decision, the leader 
needs to make the line boundaries available in shared memory, for the 
worker processes. If we moved that responsibility to the worker 
processes, you wouldn't need to keep the line boundaries in shared 
memory. A worker would only need to pass enough state to the next worker 
to tell it where to start scanning the next block.


Whether the leader process finds the EOLs or the worker processes, it's 
pretty clear that it needs to be done ASAP, for a chunk at a time, 
because that cannot be done in parallel. I think some refactoring in 
CopyReadLine() and friends would be in order. It probably would be 
faster, or at least not slower, to find all the EOLs in a block in one 
tight loop, even when parallel copy is not used.


- Heikki

Re: Parallel copy

2020-10-29 Thread Amit Kapila

On Thu, Oct 29, 2020 at 11:45 AM Amit Kapila  wrote:
>
> On Tue, Oct 27, 2020 at 7:06 PM vignesh C  wrote:
> >
> [latest version]
>
> I think the parallel-safety checks in this patch
> (v9-0002-Allow-copy-from-command-to-process-data-from-file) are
> incomplete and wrong.
>

One more point, I have noticed that some time back [1], I have given
one suggestion related to the way workers process the set of lines
(aka chunk). I think you can try by increasing the chunk size to say
100, 500, 1000 and use some shared counter to remember the number of
chunks processed.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1L-Xgw1zZEbGePmhBBWmEmLFL6rCaiOMDPnq2GNMVz-sg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

2020-10-29 Thread Daniel Westermann (DWE)

On 27/10/2020 15:36, vignesh C wrote:
>> Attached v9 patches have the fixes for the above comments.

>I did some testing:

I did some testing as well and have a cosmetic remark:

postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 10);
ERROR:  value 10 out of bounds for option "parallel"
DETAIL:  Valid values are between "1" and "1024".
postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000);
ERROR:  parallel requires an integer value
postgres=# 

Wouldn't it make more sense to only have one error message? The first one seems 
to be the better message.

Regards
Daniel

Re: Parallel copy

2020-10-29 Thread Heikki Linnakangas


On 27/10/2020 15:36, vignesh C wrote:

Attached v9 patches have the fixes for the above comments.


I did some testing:

/tmp/longdata.pl:

#!/usr/bin/perl
#
# Generate three rows:
# foo
# longdatalongdatalongdata...
# bar
#
# The length of the middle row is given as command line arg.
#

my $bytes = $ARGV[0];

print "foo\n";
for(my $i = 0; $i < $bytes; $i+=8){
print "longdata";
}
print "\n";
print "bar\n";


postgres=# copy longdata from program 'perl /tmp/longdata.pl 1' 
with (parallel 2);


This gets stuck forever (or at least I didn't have the patience to wait 
it finish). Both worker processes are consuming 100% of CPU.


- Heikki

Re: Parallel copy

2020-10-29 Thread Amit Kapila

On Tue, Oct 27, 2020 at 7:06 PM vignesh C  wrote:
>
[latest version]

I think the parallel-safety checks in this patch
(v9-0002-Allow-copy-from-command-to-process-data-from-file) are
incomplete and wrong. See below comments.
1.
+static pg_attribute_always_inline bool
+CheckExprParallelSafety(CopyState cstate)
+{
+ if (contain_volatile_functions(cstate->whereClause))
+ {
+ if (max_parallel_hazard((Query *) cstate->whereClause) != PROPARALLEL_SAFE)
+ return false;
+ }

I don't understand the above check. Why do we only need to check where
clause for parallel-safety when it contains volatile functions? It
should be checked otherwise as well, no? The similar comment applies
to other checks in this function. Also, I don't think there is a need
to make this function inline.

2.
+/*
+ * IsParallelCopyAllowed
+ *
+ * Check if parallel copy can be allowed.
+ */
+bool
+IsParallelCopyAllowed(CopyState cstate)
{
..
+ * When there are BEFORE/AFTER/INSTEAD OF row triggers on the table. We do
+ * not allow parallelism in such cases because such triggers might query
+ * the table we are inserting into and act differently if the tuples that
+ * have already been processed and prepared for insertion are not there.
+ * Now, if we allow parallelism with such triggers the behaviour would
+ * depend on if the parallel worker has already inserted or not that
+ * particular tuples.
+ */
+ if (cstate->rel->trigdesc != NULL &&
+ (cstate->rel->trigdesc->trig_insert_after_statement ||
+ cstate->rel->trigdesc->trig_insert_new_table ||
+ cstate->rel->trigdesc->trig_insert_before_row ||
+ cstate->rel->trigdesc->trig_insert_after_row ||
+ cstate->rel->trigdesc->trig_insert_instead_row))
+ return false;
..

Why do we need to disable parallelism for before/after row triggers
unless they have parallel-unsafe functions? I see a few lines down in
this function you are checking parallel-safety of trigger functions,
what is the use of the same if you are already disabling parallelism
with the above check.

3. What about if the index on table has expressions that are
parallel-unsafe? What is your strategy to check parallel-safety for
partitioned tables?

I suggest checking Greg's patch for parallel-safety of Inserts [1]. I
think you will find that most of those checks are required here as
well and see how we can use that patch (at least what is common). I
feel the first patch should be just to have parallel-safety checks and
we can test that by trying to enable Copy with force_parallel_mode. We
can build the rest of the patch atop of it or in other words, let's
move all parallel-safety work into a separate patch.

Few assorted comments:

1.
+/*
+ * ESTIMATE_NODE_SIZE - Estimate the size required for  node type in shared
+ * memory.
+ */
+#define ESTIMATE_NODE_SIZE(list, listStr, strsize) \
+{ \
+ uint32 estsize = sizeof(uint32); \
+ if ((List *)list != NIL) \
+ { \
+ listStr = nodeToString(list); \
+ estsize += strlen(listStr) + 1; \
+ } \
+ \
+ strsize = add_size(strsize, estsize); \
+}

This can be probably a function instead of a macro.

2.
+/*
+ * ESTIMATE_1BYTE_STR_SIZE - Estimate the size required for  1Byte strings in
+ * shared memory.
+ */
+#define ESTIMATE_1BYTE_STR_SIZE(src, strsize) \
+{ \
+ strsize = add_size(strsize, sizeof(uint8)); \
+ strsize = add_size(strsize, (src) ? 1 : 0); \
+}

This could be an inline function.

3.
+/*
+ * SERIALIZE_1BYTE_STR - Copy 1Byte strings to shared memory.
+ */
+#define SERIALIZE_1BYTE_STR(dest, src, copiedsize) \
+{ \
+ uint8 len = (src) ? 1 : 0; \
+ memcpy(dest + copiedsize, (uint8 *) , sizeof(uint8)); \
+ copiedsize += sizeof(uint8); \
+ if (src) \
+ dest[copiedsize++] = src[0]; \
+}

Similarly, this could be a function. I think keeping such things as
macros in-between code makes it difficult to read. Please see if you
can make these and similar macros as functions unless they are doing
few memory instructions. Having functions makes it easier to debug the
code as well.

[1] - 
https://www.postgresql.org/message-id/CAJcOf-cgfjj0NfYPrNFGmQJxsnNW102LTXbzqxQJuziar1EKfQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

RE: Parallel copy

2020-10-28 Thread Hou, Zhijie

Hi 

I found some issue in v9-0002

1.
+
+   elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d, 
unprocessed lines:%d, offset:%d, line size:%d",
+write_pos, lineInfo->first_block,
+pg_atomic_read_u32(_blk_ptr->unprocessed_line_parts),
+offset, pg_atomic_read_u32(>line_size));
+

write_pos or other variable to be printed here are type of uint32, I think 
it'better to use '%u' in elog msg.

2.
+* line_size will be set. Read the line_size again to be sure 
if it is
+* completed or partial block.
+*/
+   dataSize = pg_atomic_read_u32(>line_size);
+   if (dataSize)

It use dataSize( type int ) to get uint32 which seems a little dangerous.
Is it better to define dataSize uint32 here? 

3.
Since function with  'Cstate' in name has been changed to 'CState'
I think we can change function PopulateCommonCstateInfo as well.

4.
+   if (pcdata->worker_line_buf_count)

I think some check like the above can be 'if (xxx > 0)', which seems easier to 
understand.


Best regards,
houzj

Re: Parallel copy

On Fri, Oct 23, 2020 at 6:58 PM Ashutosh Sharma  wrote:
>
> >
> > I think, if possible, all these if-else checks in CopyFrom() can be
> > moved to a single function which can probably be named as
> > IdentifyCopyInsertMethod() and this function can be called in
> > IsParallelCopyAllowed(). This will ensure that in case of Parallel
> > Copy when the leader has performed all these checks, the worker won't
> > do it again. I also feel that it will make the code look a bit
> > cleaner.
> >
>
> Just rewriting above comment to make it a bit more clear:
>
> I think, if possible, all these if-else checks in CopyFrom() should be
> moved to a separate function which can probably be named as
> IdentifyCopyInsertMethod() and this function called from
> IsParallelCopyAllowed() and CopyFrom() functions. It will only be
> called from CopyFrom() when IsParallelCopy() returns false. This will
> ensure that in case of Parallel Copy if the leader has performed all
> these checks, the worker won't do it again. I also feel that having a
> separate function containing all these checks will make the code look
> a bit cleaner.
>

In the recent patch posted we have changed it to simplify the check
for parallel copy, it is not an exact match. I feel this comment is
not applicable on the latest patch

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

Thanks Ashutosh for reviewing and providing your comments.

On Fri, Oct 23, 2020 at 5:43 PM Ashutosh Sharma  wrote:
>
> Hi Vignesh,
>
> Thanks for the updated patches. Here are some more comments that I can
> find after reviewing your latest patches:
>
> +/*
> + * This structure helps in storing the common data from CopyStateData that 
> are
> + * required by the workers. This information will then be allocated and 
> stored
> + * into the DSM for the worker to retrieve and copy it to CopyStateData.
> + */
> +typedef struct SerializedParallelCopyState
> +{
> +   /* low-level state data */
> +   CopyDestcopy_dest;  /* type of copy source/destination */
> +   int file_encoding;  /* file or remote side's character encoding */
> +   boolneed_transcoding;   /* file encoding diff from server? */
> +   boolencoding_embeds_ascii;  /* ASCII can be non-first byte? */
> +
> ...
> ...
> +
> +   /* Working state for COPY FROM */
> +   AttrNumber  num_defaults;
> +   Oid relid;
> +} SerializedParallelCopyState;
>
> Can the above structure not be part of the CopyStateData structure? I
> am just asking this question because all the fields present in the
> above structure are also present in the CopyStateData structure. So,
> including it in the CopyStateData structure will reduce the code
> duplication and will also make CopyStateData a bit shorter.
>

I have removed the common members from the structure, now there are no
common members between CopyStateData & the new structure. I'm using
CopyStateData to copy to/from directly in the new patch.

> --
>
> +   pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
> +relid);
>
> Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
> function when we are already passing cstate structure, using which
> both of these information can be retrieved ?
>

nworkers need not be passed as you have suggested but relid need to be
passed as we will be setting it to pcdata, modified nworkers as
suggested.

> --
>
> +/* DSM keys for parallel copy.  */
> +#define PARALLEL_COPY_KEY_SHARED_INFO  1
> +#define PARALLEL_COPY_KEY_CSTATE   2
> +#define PARALLEL_COPY_WAL_USAGE3
> +#define PARALLEL_COPY_BUFFER_USAGE 4
>
> DSM key names do not appear to be consistent. For shared info and
> cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
> but for WalUsage and BufferUsage structures, it is prefixed with
> "PARALLEL_COPY". I think it would be better to make them consistent.
>

Modified as suggested

> --
>
> if (resultRelInfo->ri_TrigDesc != NULL &&
> (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
>  resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
> {
> /*
>  * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
>  * triggers on the table. Such triggers might query the table we're
>  * inserting into and act differently if the tuples that have already
>  * been processed and prepared for insertion are not there.
>  */
> insertMethod = CIM_SINGLE;
> }
> else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
>  resultRelInfo->ri_TrigDesc->trig_insert_new_table)
> {
> /*
>  * For partitioned tables we can't support multi-inserts when there
>  * are any statement level insert triggers. It might be possible to
>  * allow partitioned tables with such triggers in the future, but for
>  * now, CopyMultiInsertInfoFlush expects that any before row insert
>  * and statement level insert triggers are on the same relation.
>  */
> insertMethod = CIM_SINGLE;
> }
> else if (resultRelInfo->ri_FdwRoutine != NULL ||
>  cstate->volatile_defexprs)
> {
> ...
> ...
>
> I think, if possible, all these if-else checks in CopyFrom() can be
> moved to a single function which can probably be named as
> IdentifyCopyInsertMethod() and this function can be called in
> IsParallelCopyAllowed(). This will ensure that in case of Parallel
> Copy when the leader has performed all these checks, the worker won't
> do it again. I also feel that it will make the code look a bit
> cleaner.
>

In the recent patch posted we have changed it to simplify the check
for parallel copy, it is not an exact match. I feel this comment is
not applicable on the latest patch

> --
>
> +void
> +ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
> +{
&

Re: Parallel copy

On Wed, Oct 21, 2020 at 3:50 PM Amit Kapila  wrote:
>
> On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy
>  wrote:
> >
> >
> > 9. Instead of calling CopyStringToSharedMemory() for each string
> > variable, can't we just create a linked list of all the strings that
> > need to be copied into shm and call CopyStringToSharedMemory() only
> > once? We could avoid 5 function calls?
> >
>
> If we want to avoid different function calls then can't we just store
> all these strings in a local structure and use it? That might improve
> the other parts of code as well where we are using these as individual
> parameters.
>

I have made one structure SerializedListToStrCState to store all the
variables. The rest of the common variables is directly copied from &
into cstate.

> > 10. Similar to above comment: can we fill all the required
> > cstate->variables inside the function CopyNodeFromSharedMemory() and
> > call it only once? In each worker we could save overhead of 5 function
> > calls.
> >
>
> Yeah, that makes sense.
>

I feel keeping it this way makes the code more readable, and also this
is not in a performance intensive tight loop. I'm  retaining the
change as is unless we feel this will make an impact.

This is addressed in v9 patch shared at [1].
[1] - 
https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

Thanks Heikki for reviewing and providing your comments. Please find
my thoughts below.

On Fri, Oct 23, 2020 at 2:01 PM Heikki Linnakangas  wrote:
>
> I had a brief look at at this patch. Important work! A couple of first
> impressions:
>
> 1. The split between patches
> 0002-Framework-for-leader-worker-in-parallel-copy.patch and
> 0003-Allow-copy-from-command-to-process-data-from-file.patch is quite
> artificial. All the stuff introduced in the first is unused until the
> second patch is applied. The first patch introduces a forward
> declaration for ParallelCopyData(), but the function only comes in the
> second patch. The comments in the first patch talk about
> LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only
> comes in the second patch. I think these have to merged into one. If you
> want to split it somehow, I'd suggest having a separate patch just to
> move CopyStateData from copy.c to copy.h. The subsequent patch would
> then be easier to read as you could see more easily what's being added
> to CopyStateData. Actually I think it would be better to have a new
> header file, copy_internal.h, to hold CopyStateData and the other
> structs, and keep copy.h as it is.
>

I have merged 0002 & 0003 patch, I have moved few things like creation
of copy_internal.h, moving of CopyStateData from copy.c into
copy_internal.h into 0001 patch.

> 2. This desperately needs some kind of a high-level overview of how it
> works. What is a leader, what is a worker? Which process does each step
> of COPY processing, like reading from the file/socket, splitting the
> input into lines, handling escapes, calling input functions, and
> updating the heap and indexes? What data structures are used for the
> communication? How does is the work synchronized between the processes?
> There are comments on those individual aspects scattered in the patch,
> but if you're not already familiar with it, you don't know where to
> start. There's some of that in the commit message, but it needs to be
> somewhere in the source code, maybe in a long comment at the top of
> copyparallel.c.
>

Added it in copyparallel.c

> 3. I'm surprised there's a separate ParallelCopyLineBoundary struct for
> every input line. Doesn't that incur a lot of synchronization overhead?
> I haven't done any testing, this is just my gut feeling, but I assumed
> you'd work in batches of, say, 100 or 1000 lines each.
>

Data read from the file will be stored in DSM which is of size 64k *
1024. Leader will parse and identify the line boundary like which line
starts from which data block, what is the starting offset in the data
block, what is the line size, this information will be present in
ParallelCopyLineBoundary. Like you said, each worker processes
WORKER_CHUNK_COUNT 64 lines at a time. Performance test results run
for parallel copy are available at [1]. This is addressed in v9 patch
shared at [2].

[1] 
https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

On Wed, Oct 21, 2020 at 4:20 PM Bharath Rupireddy
 wrote:
>
> On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy
>  wrote:
> >
> > 17. Remove extra lines after #define IsHeaderLine()
> > (cstate->header_line && cstate->cur_lineno == 1) in copy.h
> >
>
>  I missed one comment:
>
>  18. I think we need to treat the number of parallel workers as an
> integer similar to the parallel option in vacuum.
>
> postgres=# copy t1 from stdin with(parallel '1');  < - we
> should not allow this.
> Enter data to be copied followed by a newline.
>
> postgres=# vacuum (parallel '1') t1;
> ERROR:  parallel requires an integer value
>

I have made the behavior the same as vacuum.
This is addressed in v9 patch shared at [1].
[1] - 
https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-23 Thread Ashutosh Sharma

On Fri, Oct 23, 2020 at 5:42 PM Ashutosh Sharma  wrote:
>
> Hi Vignesh,
>
> Thanks for the updated patches. Here are some more comments that I can
> find after reviewing your latest patches:
>
> +/*
> + * This structure helps in storing the common data from CopyStateData that 
> are
> + * required by the workers. This information will then be allocated and 
> stored
> + * into the DSM for the worker to retrieve and copy it to CopyStateData.
> + */
> +typedef struct SerializedParallelCopyState
> +{
> +   /* low-level state data */
> +   CopyDestcopy_dest;  /* type of copy source/destination */
> +   int file_encoding;  /* file or remote side's character encoding */
> +   boolneed_transcoding;   /* file encoding diff from server? */
> +   boolencoding_embeds_ascii;  /* ASCII can be non-first byte? */
> +
> ...
> ...
> +
> +   /* Working state for COPY FROM */
> +   AttrNumber  num_defaults;
> +   Oid relid;
> +} SerializedParallelCopyState;
>
> Can the above structure not be part of the CopyStateData structure? I
> am just asking this question because all the fields present in the
> above structure are also present in the CopyStateData structure. So,
> including it in the CopyStateData structure will reduce the code
> duplication and will also make CopyStateData a bit shorter.
>
> --
>
> +   pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
> +relid);
>
> Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
> function when we are already passing cstate structure, using which
> both of these information can be retrieved ?
>
> --
>
> +/* DSM keys for parallel copy.  */
> +#define PARALLEL_COPY_KEY_SHARED_INFO  1
> +#define PARALLEL_COPY_KEY_CSTATE   2
> +#define PARALLEL_COPY_WAL_USAGE3
> +#define PARALLEL_COPY_BUFFER_USAGE 4
>
> DSM key names do not appear to be consistent. For shared info and
> cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
> but for WalUsage and BufferUsage structures, it is prefixed with
> "PARALLEL_COPY". I think it would be better to make them consistent.
>
> --
>
> if (resultRelInfo->ri_TrigDesc != NULL &&
> (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
>  resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
> {
> /*
>  * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
>  * triggers on the table. Such triggers might query the table we're
>  * inserting into and act differently if the tuples that have already
>  * been processed and prepared for insertion are not there.
>  */
> insertMethod = CIM_SINGLE;
> }
> else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
>  resultRelInfo->ri_TrigDesc->trig_insert_new_table)
> {
> /*
>  * For partitioned tables we can't support multi-inserts when there
>  * are any statement level insert triggers. It might be possible to
>  * allow partitioned tables with such triggers in the future, but for
>  * now, CopyMultiInsertInfoFlush expects that any before row insert
>  * and statement level insert triggers are on the same relation.
>  */
> insertMethod = CIM_SINGLE;
> }
> else if (resultRelInfo->ri_FdwRoutine != NULL ||
>  cstate->volatile_defexprs)
> {
> ...
> ...
>
> I think, if possible, all these if-else checks in CopyFrom() can be
> moved to a single function which can probably be named as
> IdentifyCopyInsertMethod() and this function can be called in
> IsParallelCopyAllowed(). This will ensure that in case of Parallel
> Copy when the leader has performed all these checks, the worker won't
> do it again. I also feel that it will make the code look a bit
> cleaner.
>

Just rewriting above comment to make it a bit more clear:

I think, if possible, all these if-else checks in CopyFrom() should be
moved to a separate function which can probably be named as
IdentifyCopyInsertMethod() and this function called from
IsParallelCopyAllowed() and CopyFrom() functions. It will only be
called from CopyFrom() when IsParallelCopy() returns false. This will
ensure that in case of Parallel Copy if the leader has performed all
these checks, the worker won't do it again. I also feel that having a
separate function containing all these checks will make the code look
a bit cleaner.

> --
>
> +void
> +ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
&

Re: Parallel copy

2020-10-23 Thread Ashutosh Sharma

Hi Vignesh,

Thanks for the updated patches. Here are some more comments that I can
find after reviewing your latest patches:

+/*
+ * This structure helps in storing the common data from CopyStateData that are
+ * required by the workers. This information will then be allocated and stored
+ * into the DSM for the worker to retrieve and copy it to CopyStateData.
+ */
+typedef struct SerializedParallelCopyState
+{
+   /* low-level state data */
+   CopyDestcopy_dest;  /* type of copy source/destination */
+   int file_encoding;  /* file or remote side's character encoding */
+   boolneed_transcoding;   /* file encoding diff from server? */
+   boolencoding_embeds_ascii;  /* ASCII can be non-first byte? */
+
...
...
+
+   /* Working state for COPY FROM */
+   AttrNumber  num_defaults;
+   Oid relid;
+} SerializedParallelCopyState;

Can the above structure not be part of the CopyStateData structure? I
am just asking this question because all the fields present in the
above structure are also present in the CopyStateData structure. So,
including it in the CopyStateData structure will reduce the code
duplication and will also make CopyStateData a bit shorter.

--

+   pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
+relid);

Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
function when we are already passing cstate structure, using which
both of these information can be retrieved ?

--

+/* DSM keys for parallel copy.  */
+#define PARALLEL_COPY_KEY_SHARED_INFO  1
+#define PARALLEL_COPY_KEY_CSTATE   2
+#define PARALLEL_COPY_WAL_USAGE3
+#define PARALLEL_COPY_BUFFER_USAGE 4

DSM key names do not appear to be consistent. For shared info and
cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
but for WalUsage and BufferUsage structures, it is prefixed with
"PARALLEL_COPY". I think it would be better to make them consistent.

--

if (resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
{
/*
 * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
 * triggers on the table. Such triggers might query the table we're
 * inserting into and act differently if the tuples that have already
 * been processed and prepared for insertion are not there.
 */
insertMethod = CIM_SINGLE;
}
else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
 resultRelInfo->ri_TrigDesc->trig_insert_new_table)
{
/*
 * For partitioned tables we can't support multi-inserts when there
 * are any statement level insert triggers. It might be possible to
 * allow partitioned tables with such triggers in the future, but for
 * now, CopyMultiInsertInfoFlush expects that any before row insert
 * and statement level insert triggers are on the same relation.
 */
insertMethod = CIM_SINGLE;
}
else if (resultRelInfo->ri_FdwRoutine != NULL ||
 cstate->volatile_defexprs)
{
...
...

I think, if possible, all these if-else checks in CopyFrom() can be
moved to a single function which can probably be named as
IdentifyCopyInsertMethod() and this function can be called in
IsParallelCopyAllowed(). This will ensure that in case of Parallel
Copy when the leader has performed all these checks, the worker won't
do it again. I also feel that it will make the code look a bit
cleaner.

--

+void
+ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
+{
...
...
+   InstrEndParallelQuery([ParallelWorkerNumber],
+ [ParallelWorkerNumber]);
+
+   MemoryContextSwitchTo(oldcontext);
+   pfree(cstate);
+   return;
+}

It seems like you also need to delete the memory context
(cstate->copycontext) here.

--

+void
+ExecBeforeStmtTrigger(CopyState cstate)
+{
+   EState *estate = CreateExecutorState();
+   ResultRelInfo *resultRelInfo;

This function has a lot of comments which have been copied as it is
from the CopyFrom function, I think it would be good to remove those
comments from here and mention that this code changes done in this
function has been taken from the CopyFrom function. If any queries
people may refer to the CopyFrom function. This will again avoid the
unnecessary code in the patch.

--

As Heikki rightly pointed out in his previous email, we need some high
level description of how Parallel Copy works somewhere in
copyparallel.c file. For reference, please see how a brief description
about parallel vacuum has been added in the vacuumlazy.c file.

 * Lazy vacuum supports parallel execution with parallel worker processes.  In
 * a parallel vacuum, we perform both

Re: Parallel copy

2020-10-23 Thread Heikki Linnakangas

I had a brief look at at this patch. Important work! A couple of first 
impressions:


1. The split between patches 
0002-Framework-for-leader-worker-in-parallel-copy.patch and 
0003-Allow-copy-from-command-to-process-data-from-file.patch is quite 
artificial. All the stuff introduced in the first is unused until the 
second patch is applied. The first patch introduces a forward 
declaration for ParallelCopyData(), but the function only comes in the 
second patch. The comments in the first patch talk about 
LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only 
comes in the second patch. I think these have to merged into one. If you 
want to split it somehow, I'd suggest having a separate patch just to 
move CopyStateData from copy.c to copy.h. The subsequent patch would 
then be easier to read as you could see more easily what's being added 
to CopyStateData. Actually I think it would be better to have a new 
header file, copy_internal.h, to hold CopyStateData and the other 
structs, and keep copy.h as it is.


2. This desperately needs some kind of a high-level overview of how it 
works. What is a leader, what is a worker? Which process does each step 
of COPY processing, like reading from the file/socket, splitting the 
input into lines, handling escapes, calling input functions, and 
updating the heap and indexes? What data structures are used for the 
communication? How does is the work synchronized between the processes? 
There are comments on those individual aspects scattered in the patch, 
but if you're not already familiar with it, you don't know where to 
start. There's some of that in the commit message, but it needs to be 
somewhere in the source code, maybe in a long comment at the top of 
copyparallel.c.


3. I'm surprised there's a separate ParallelCopyLineBoundary struct for 
every input line. Doesn't that incur a lot of synchronization overhead? 
I haven't done any testing, this is just my gut feeling, but I assumed 
you'd work in batches of, say, 100 or 1000 lines each.


- Heikki

Re: Parallel copy

2020-10-21 Thread Bharath Rupireddy

On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy
 wrote:
>
> 17. Remove extra lines after #define IsHeaderLine()
> (cstate->header_line && cstate->cur_lineno == 1) in copy.h
>

 I missed one comment:

 18. I think we need to treat the number of parallel workers as an
integer similar to the parallel option in vacuum.

postgres=# copy t1 from stdin with(parallel '1');  < - we
should not allow this.
Enter data to be copied followed by a newline.

postgres=# vacuum (parallel '1') t1;
ERROR:  parallel requires an integer value

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-21 Thread Amit Kapila

On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy
 wrote:
>
>
> 9. Instead of calling CopyStringToSharedMemory() for each string
> variable, can't we just create a linked list of all the strings that
> need to be copied into shm and call CopyStringToSharedMemory() only
> once? We could avoid 5 function calls?
>

If we want to avoid different function calls then can't we just store
all these strings in a local structure and use it? That might improve
the other parts of code as well where we are using these as individual
parameters.

> 10. Similar to above comment: can we fill all the required
> cstate->variables inside the function CopyNodeFromSharedMemory() and
> call it only once? In each worker we could save overhead of 5 function
> calls.
>

Yeah, that makes sense.

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

2020-10-21 Thread Bharath Rupireddy

Hi Vignesh,

I took a look at the v8 patch set. Here are some comments:

1. PopulateCommonCstateInfo() -- can we use PopulateCommonCStateInfo()
or PopulateCopyStateInfo()? And also EstimateCstateSize() --
EstimateCStateSize(), PopulateCstateCatalogInfo() --
PopulateCStateCatalogInfo()?

2. Instead of mentioning numbers like 1024, 64K, 10240 in the
comments, can we represent them in terms of macros?
/* It can hold 1024 blocks of 64K data in DSM to be processed by the worker. */
#define MAX_BLOCKS_COUNT 1024
/*
 * It can hold upto 10240 record information for worker to process. RINGSIZE

3. How about
"
Each worker at once will pick the WORKER_CHUNK_COUNT records from the
DSM data blocks and store them in it's local memory.
This is to make workers not contend much while getting record
information from the DSM. Read RINGSIZE comments before
 changing this value.
"
instead of
/*
 * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
 * block to process to avoid lock contention. Read RINGSIZE comments before
 * changing this value.
 */

4.  How about one line gap before and after for comments: "Leader
should operate in the following order:" and "Worker should operate in
the following order:"

5. Can we move RAW_BUF_BYTES macro definition to the beginning of the
copy.h where all the macro are defined?

6. I don't think we need the change in toast_internals.c with the
temporary hack Assert(!(IsParallelWorker() && !currentCommandIdUsed));
in GetCurrentCommandId()

7. I think
/* Can't perform copy in parallel */
if (parallel_workers <= 0)
return NULL;
can be
/* Can't perform copy in parallel */
if (parallel_workers == 0)
return NULL;
as parallel_workers can never be < 0 since we enter BeginParallelCopy
only if cstate->nworkers > 0 and also we are not allowed to have
negative values for max_worker_processes.

8. Do we want to pfree(cstate->pcdata) in case we failed to start any
parallel workers, we would have allocated a good
else
{
/*
 * Reset nworkers to -1 here. This is useful in cases where user
 * specifies parallel workers, but, no worker is picked up, so go
 * back to non parallel mode value of nworkers.
 */
cstate->nworkers = -1;
*processed = CopyFrom(cstate);/* copy from file to database */
}

9. Instead of calling CopyStringToSharedMemory() for each string
variable, can't we just create a linked list of all the strings that
need to be copied into shm and call CopyStringToSharedMemory() only
once? We could avoid 5 function calls?

10. Similar to above comment: can we fill all the required
cstate->variables inside the function CopyNodeFromSharedMemory() and
call it only once? In each worker we could save overhead of 5 function
calls.

11. Looks like CopyStringFromSharedMemory() and
CopyNodeFromSharedMemory() do almost the same things except
stringToNode() and pfree(destptr);. Can we have a generic function
CopyFromSharedMemory() or something else and handle with flag "bool
isnode" to differentiate the two use cases?

12. Can we move below check to the end in IsParallelCopyAllowed()?
/* Check parallel safety of the trigger functions. */
if (cstate->rel->trigdesc != NULL &&
!CheckRelTrigFunParallelSafety(cstate->rel->trigdesc))
return false;

13. CacheLineInfo(): Instead of goto empty_data_line_update; how about
having this directly inside the if block as it's being used only once?

14. GetWorkerLine(): How about avoiding goto statements and replacing
the common code with a always static inline function or a macro?

15. UpdateSharedLineInfo(): Below line is misaligned.
lineInfo->first_block = blk_pos;
lineInfo->start_offset = offset;

16. ParallelCopyFrom(): Do we need CHECK_FOR_INTERRUPTS(); at the
start of  for (;;)?

17. Remove extra lines after #define IsHeaderLine()
(cstate->header_line && cstate->cur_lineno == 1) in copy.h

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-21 Thread vignesh C

On Thu, Oct 8, 2020 at 11:15 AM vignesh C  wrote:
>
> I'm summarizing the pending open points so that I don't miss anything:
> 1) Performance test on latest patch set.

It is tested and results are shared by bharath at [1]

> 2) Testing points suggested.

Tests are added as suggested and details shared by bharath at [1]

> 3) Support of parallel copy for COPY_OLD_FE.

It is handled as part of v8 patch shared at [2]

> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.

Open

> 5) Handling of Tomas's comments.

I have fixed and updated the fix details as part of [3]

> 6) Handling of Greg's comments.

I have fixed and updated the fix details as part of [4]

Except for "4) Worker has to hop through all the processed chunks before
getting the chunk which it can process", all open tasks are handled. I will
work on this and provide an update shortly.

[1]
https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/caldanm2ucmcmozcbkl8b7az9oyd9hz+fndczhssiiqj4v-x...@mail.gmail.com
[3]
https://www.postgresql.org/message-id/CALDaNm0_zUa9%2BS%3DpwCz3Yp43SY3r9bnO4v-9ucXUujEE%3D0Sd7g%40mail.gmail.com
[4]
https://www.postgresql.org/message-id/CALDaNm31pGG%2BL9N4HbM0mO4iuceih4mJ5s87jEwOPaFLpmDKyQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-20 Thread Bharath Rupireddy

On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:
>
> On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila 
wrote:
> >
> > 2. Do we have tests for toast tables? I think if you implement the
> > previous point some existing tests might cover it but I feel we should
> > have at least one or two tests for the same.
> >
> Toast table use case 1: 1 tuples, 9.6GB data, 3 indexes 2 on integer
columns, 1 on text column(not the toast column), csv file, each row is >
1320KB:
> (222.767, 0, 1X), (134.171, 1, 1.66X), (93.749, 2, 2.38X), (93.672, 4,
2.38X), (94.827, 8, 2.35X), (93.766, 16, 2.37X), (98.153, 20, 2.27X),
(122.721, 30, 1.81X)
>
> Toast table use case 2: 10 tuples, 96GB data, 3 indexes 2 on integer
columns, 1 on text column(not the toast column), csv file, each row is >
1320KB:
> (2255.032, 0, 1X), (1358.628, 1, 1.66X), (901.170, 2, 2.5X), (912.743, 4,
2.47X), (988.718, 8, 2.28X), (938.000, 16, 2.4X), (997.556, 20, 2.26X),
(1000.586, 30, 2.25X)
>
> Toast table use case3: 1 tuples, 9.6GB, no indexes, binary file, each
row is > 1320KB:
> (136.983, 0, 1X), (136.418, 1, 1X), (81.896, 2, 1.66X), (62.929, 4,
2.16X), (52.311, 8, 2.6X), (40.032, 16, 3.49X), (44.097, 20, 3.09X),
(62.310, 30, 2.18X)
>
> In the case of a Toast table, we could achieve upto 2.5X for csv files,
and 3.5X for binary files. We are analyzing this point and will post an
update on our findings soon.
>

I analyzed the above point of getting only upto 2.5X performance
improvement for csv files with a toast table with 3 indexers - 2 on integer
columns and 1 on text column(not the toast column). Reason is that workers
are fast enough to do the work and they are waiting for the leader to fill
in the data blocks and in this case the leader is able to serve the workers
at its maximum possible speed. Hence most of the time the workers are
waiting not doing any beneficial work.

Having observed the above point, I tried to make workers perform more work
to avoid waiting time. For this, I added a gist index on the toasted text
column. The use and results are as follows.

Toast table use case4: 1 tuples, 9.6GB, 4 indexes - 2 on integer
columns, 1 on non-toasted text column and 1 gist index on toasted text
column, csv file, each row is  ~ 12.2KB:

(1322.839, 0, 1X), (1261.176, 1, 1.05X), (632.296, 2, 2.09X), (321.941, 4,
4.11X), (181.796, 8, 7.27X), *(105.750, 16, 12.51X)*, (107.099, 20,
12.35X), (123.262, 30, 10.73X)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-19 Thread Amit Kapila

On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie  wrote:
>
> Hi Vignesh,
>
> After having a look over the patch,
> I have some suggestions for
> 0003-Allow-copy-from-command-to-process-data-from-file.patch.
>
> 1.
>
> +static uint32
> +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List 
> *attnamelist,
> +  char **whereClauseStr, char 
> **rangeTableStr,
> +  char **attnameListStr, char 
> **notnullListStr,
> +  char **nullListStr, char **convertListStr)
> +{
> +   uint32  strsize = 
> MAXALIGN(sizeof(SerializedParallelCopyState));
> +
> +   strsize += EstimateStringSize(cstate->null_print);
> +   strsize += EstimateStringSize(cstate->delim);
> +   strsize += EstimateStringSize(cstate->quote);
> +   strsize += EstimateStringSize(cstate->escape);
>
>
> It use function EstimateStringSize to get the strlen of null_print, delim, 
> quote and escape.
> But the length of null_print seems has been stored in null_print_len.
> And delim/quote/escape must be 1 byte, so I think call strlen again seems 
> unnecessary.
>
> How about  " strsize += sizeof(uint32) + cstate->null_print_len + 1"
>

+1. This seems like a good suggestion but add comments for
delim/quote/escape to indicate that we are considering one-byte for
each. I think this will obviate the need of function
EstimateStringSize. Another thing in this regard is that we normally
use add_size function to compute the size but I don't see that being
used in this and nearby computation. That helps us to detect overflow
of addition if any.

EstimateCstateSize()
{
..
+
+ strsize++;
..
}

Why do we need this additional one-byte increment? Does it make sense
to add a small comment for the same?

> 2.
> +   strsize += EstimateNodeSize(cstate->whereClause, whereClauseStr);
>
> +   copiedsize += CopyStringToSharedMemory(cstate, whereClauseStr,
> + 
>  shmptr + copiedsize);
>
> Some string length is counted for two times.
> The ' whereClauseStr ' has call strlen in EstimateNodeSize once and call 
> strlen in CopyStringToSharedMemory again.
> I don't know wheather it's worth to refacor the code to avoid duplicate 
> strlen . what do you think ?
>

It doesn't seem worth to me. We probably need to use additional
variables to save those lengths. I think it will add more
code/complexity than we will save. See EstimateParamListSpace and
SerializeParamList where we get the typeLen each time, that way code
looks neat to me and we are don't going to save much by not following
a similar thing here.

-- 
With Regards,
Amit Kapila.

RE: Parallel copy

2020-10-17 Thread Hou, Zhijie

Hi Vignesh,

After having a look over the patch,
I have some suggestions for 
0003-Allow-copy-from-command-to-process-data-from-file.patch.

1.

+static uint32
+EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist,
+  char **whereClauseStr, char **rangeTableStr,
+  char **attnameListStr, char **notnullListStr,
+  char **nullListStr, char **convertListStr)
+{
+   uint32  strsize = MAXALIGN(sizeof(SerializedParallelCopyState));
+
+   strsize += EstimateStringSize(cstate->null_print);
+   strsize += EstimateStringSize(cstate->delim);
+   strsize += EstimateStringSize(cstate->quote);
+   strsize += EstimateStringSize(cstate->escape);


It use function EstimateStringSize to get the strlen of null_print, delim, 
quote and escape.
But the length of null_print seems has been stored in null_print_len.
And delim/quote/escape must be 1 byte, so I think call strlen again seems 
unnecessary.

How about  " strsize += sizeof(uint32) + cstate->null_print_len + 1"

2.
+   strsize += EstimateNodeSize(cstate->whereClause, whereClauseStr);

+   copiedsize += CopyStringToSharedMemory(cstate, whereClauseStr,
+   
   shmptr + copiedsize);

Some string length is counted for two times.
The ' whereClauseStr ' has call strlen in EstimateNodeSize once and call strlen 
in CopyStringToSharedMemory again.
I don't know wheather it's worth to refacor the code to avoid duplicate strlen 
. what do you think ?

Best regards,
houzj

Re: Parallel copy

2020-10-15 Thread Amit Kapila

On Wed, Oct 14, 2020 at 6:51 PM vignesh C  wrote:
>
> On Fri, Oct 9, 2020 at 11:01 AM Amit Kapila  wrote:
> >
> > I am not able to properly parse the data but If understand the wal
> > data for non-parallel (1116 |   0 |   3587203) and parallel (1119
> > |   6 |   3624405) case doesn't seem to be the same. Is that
> > right? If so, why? Please ensure that no checkpoint happens for both
> > cases.
> >
>
> I have disabled checkpoint, the results with the checkpoint disabled
> are given below:
>| wal_records | wal_fpi | wal_bytes
> Sequential Copy   | 1116|   0   |   3587669
> Parallel Copy(1 worker) | 1116|   0   |   3587669
> Parallel Copy(4 worker) | 1121|   0   |   3587668
> I noticed that for 1 worker wal_records & wal_bytes are same as
> sequential copy, but with different worker count I had noticed that
> there is difference in wal_records & wal_bytes, I think the difference
> should be ok because with more than 1 worker the order of records
> processed will be different based on which worker picks which records
> to process from input file. In the case of sequential copy/1 worker
> the order in which the records will be processed is always in the same
> order hence wal_bytes are the same.
>

Are all records of the same size in your test? If so, then why the
order should matter? Also, even the number of wal_records has
increased but wal_bytes are not increased, rather it is one-byte less.
Can we identify what is going on here? I don't intend to say that it
is a problem but we should know the reason clearly.

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow  wrote:
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C  wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following
patch:
>
> v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) , sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> memcpy(destptr, , sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>
> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>

I'm not handling this, this is similar to how it is handled in other places.

> Then you could safely use:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> *(uint16 *)destptr = len;
> *copiedsize += sizeof(uint16);
> if (len)
> {
> memcpy(destptr + sizeof(uint16), srcPtr, len);
> *copiedsize += len;
> }
>
> and in the CopyStringFromSharedMemory() function, then could safely use:
>
> len = *(uint16 *)srcPtr;
>
> The compiler may be smart enough to optimize-away the memcpy() in this
> case anyway, but there are issues in doing this for architectures that
> take a performance hit for unaligned access, or don't support
> unaligned access.

Changed it to uin32, so that there are no issues in case if length exceeds
65535 & also to avoid problems in Big Endian architecture.

> Also, in CopyFromSharedMemory() functions, you should use palloc()
> instead of palloc0(), as you're filling the entire palloc'd buffer
> anyway, so no need to ask for additional MemSet() of all buffer bytes
> to 0 prior to memcpy().
>

I have changed palloc0 to palloc.

Thanks Greg for reviewing & providing your comments. These changes are
fixed in one of my earlier mail [1] that I sent.
[1]
https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

On Sat, Oct 3, 2020 at 6:20 AM Tomas Vondra 
wrote:
>
> Hello Vignesh,
>
> I've done some basic benchmarking on the v4 version of the patches (but
> AFAIKC the v5 should perform about the same), and some initial review.
>
> For the benchmarking, I used the lineitem table from TPC-H - for 75GB
> data set, this largest table is about 64GB once loaded, with another
> 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and
> NVME storage.
>
> The COPY duration with varying number of workers (specified using the
> parallel COPY option) looks like this:
>
>   workersduration
>  -
> 01366
> 11255
> 2 704
> 3 526
> 4 434
> 5 385
> 6 347
> 7 322
> 8 327
>
> So this seems to work pretty well - initially we get almost linear
> speedup, then it slows down (likely due to contention for locks, I/O
> etc.). Not bad.

Thanks for testing with different workers & posting the results.

> I've only done a quick review, but overall the patch looks in fairly
> good shape.
>
> 1) I don't quite understand why we need INCREMENTPROCESSED and
> RETURNPROCESSED, considering it just does ++ or return. It just
> obfuscated the code, I think.
>

I have removed the macros.

> 2) I find it somewhat strange that BeginParallelCopy can just decide not
> to do parallel copy after all. Why not to do this decisions in the
> caller? Or maybe it's fine this way, not sure.
>

I have moved the check IsParallelCopyAllowed to the caller.

> 3) AFAIK we don't modify typedefs.list in patches, so these changes
> should be removed.
>

I had seen that in many of the commits typedefs.list is getting changed,
also it helps in running pgindent. So I'm retaining this change.

> 4) IsTriggerFunctionParallelSafe actually checks all triggers, not just
> one, so the comment needs minor rewording.
>

Modified the comments.

Thanks for the comments & sharing the test results Tomas, These changes are
fixed in one of my earlier mail [1] that I sent.

[1]
https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

ing:
> > postgres=# select * from pg_stat_statements where query like '%copy%';
> >  userid | dbid  |   queryid|
> >  query
> >| plans | total_plan_time |
> > min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time |
> > calls | total_exec_time | min_exec_time | max_exec_time |
> > mean_exec_time | stddev_exec_time |  rows  | shared_blks_hi
> > t | shared_blks_read | shared_blks_dirtied | shared_blks_written |
> > local_blks_hit | local_blks_read | local_blks_dirtied |
> > local_blks_written | temp_blks_read | temp_blks_written | blk_
> > read_time | blk_write_time | wal_records | wal_fpi | wal_bytes
> > +---+--+-+---+-+-
> > --+---++--+---+-+---+---++--++---
> > --+--+-+-++-++++---+-
> > --++-+-+---
> >  10 | 13743 | -6947756673093447609 | copy hw from
> > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> > csv, delimiter ',')   | 0 |   0 |
> > 0 | 0 |  0 |0 |
> >  1 |  265.195105 |265.195105 |265.195105 | 265.195105
> > |0 | 175000 |191
> > 6 |0 | 946 | 946 |
> >  0 |   0 |  0 |  0
> > |  0 | 0 |
> > 0 |  0 |1116 |   0 |   3587203
> >  10 | 13743 |  8570215596364326047 | copy hw from
> > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> > csv, delimiter ',', parallel '2') | 0 |   0 |
> > 0 | 0 |  0 |0 |
> >  1 |35668.402482 |  35668.402482 |  35668.402482 |   35668.402482
> > |    0 | 175000 |310
> > 1 |   36 |     952 | 919 |
> >  0 |   0 |  0 |  0
> > |  0 | 0 |
> > 0 |  0 |1119 |   6 |   3624405
> > (2 rows)
> >
>
> I am not able to properly parse the data but If understand the wal
> data for non-parallel (1116 |   0 |   3587203) and parallel (1119
> |   6 |   3624405) case doesn't seem to be the same. Is that
> right? If so, why? Please ensure that no checkpoint happens for both
> cases.
>

I have disabled checkpoint, the results with the checkpoint disabled
are given below:
   | wal_records | wal_fpi | wal_bytes
Sequential Copy   | 1116|   0   |   3587669
Parallel Copy(1 worker) | 1116|   0   |   3587669
Parallel Copy(4 worker) | 1121|   0   |   3587668
I noticed that for 1 worker wal_records & wal_bytes are same as
sequential copy, but with different worker count I had noticed that
there is difference in wal_records & wal_bytes, I think the difference
should be ok because with more than 1 worker the order of records
processed will be different based on which worker picks which records
to process from input file. In the case of sequential copy/1 worker
the order in which the records will be processed is always in the same
order hence wal_bytes are the same.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

On Fri, Oct 9, 2020 at 10:42 AM Amit Kapila  wrote:
>
> On Thu, Oct 8, 2020 at 12:14 AM vignesh C  wrote:
> >
> > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila 
wrote:
> > >
> > >
> > > I am convinced by the reason given by Kyotaro-San in that another
> > > thread [1] and performance data shown by Peter that this can't be an
> > > independent improvement and rather in some cases it can do harm. Now,
> > > if you need it for a parallel-copy path then we can change it
> > > specifically to the parallel-copy code path but I don't understand
> > > your reason completely.
> > >
> >
> > Whenever we need data to be populated, we will get a new data block &
> > pass it to CopyGetData to populate the data. In case of file copy, the
> > server will completely fill the data block. We expect the data to be
> > filled completely. If data is available it will completely load the
> > complete data block in case of file copy. There is no scenario where
> > even if data is present a partial data block will be returned except
> > for EOF or no data available. But in case of STDIN data copy, even
> > though there is 8K data available in data block & 8K data available in
> > STDIN, CopyGetData will return as soon as libpq buffer data is more
> > than the minread. We will pass new data block every time to load data.
> > Every time we pass an 8K data block but CopyGetData loads a few bytes
> > in the new data block & returns. I wanted to keep the same data
> > population logic for both file copy & STDIN copy i.e copy full 8K data
> > blocks & then the populated data can be required. There is an
> > alternative solution I can have some special handling in case of STDIN
> > wherein the existing data block can be passed with the index from
> > where the data should be copied. Thoughts?
> >
>
> What you are proposing as an alternative solution, isn't that what we
> are doing without the patch? IIUC, you require this because of your
> corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that
> right? If so, what is the difficulty in making it behave similar to
> the non-parallel case?
>

The alternate solution is similar to how existing copy handles STDIN
copies, I have made changes in the v7 patch attached in [1] to have
parallel copy handle STDIN data similar to non parallel copy, so the
original comment on why this change is required has been removed from 001
patch:
> > + if (cstate->copy_dest == COPY_NEW_FE)
> > + minread = RAW_BUF_SIZE - nbytes;
> > +
> >   inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
> > -   1, RAW_BUF_SIZE - nbytes);
> > +   minread, RAW_BUF_SIZE - nbytes);
> >
> > No comment to explain why this change is done?

[1]
https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-14 Thread Bharath Rupireddy

I did performance testing on v7 patch set[1] with custom
postgresql.conf[2]. The results are of the triplet form (exec time in
sec, number of workers, gain)

Use case 1: 10million rows, 5.2GB data, 2 indexes on integer columns,
1 index on text column, binary file
(1104.898, 0, 1X), (1112.221, 1, 1X), (640.236, 2, 1.72X), (335.090,
4, 3.3X), (200.492, 8, 5.51X), (131.448, 16, 8.4X), (121.832, 20,
9.1X), (124.287, 30, 8.9X)

Use case 2: 10million rows, 5.2GB data,2 indexes on integer columns, 1
index on text column, copy from stdin, csv format
(1203.282, 0, 1X), (1135.517, 1, 1.06X), (655.140, 2, 1.84X),
(343.688, 4, 3.5X), (203.742, 8, 5.9X), (144.793, 16, 8.31X),
(133.339, 20, 9.02X), (136.672, 30, 8.8X)

Use case 3: 10million rows, 5.2GB data,2 indexes on integer columns, 1
index on text column, text file
(1165.991, 0, 1X), (1128.599, 1, 1.03X), (644.793, 2, 1.81X),
(342.813, 4, 3.4X), (204.279, 8, 5.71X), (139.986, 16, 8.33X),
(128.259, 20, 9.1X), (132.764, 30, 8.78X)

Above results are similar to the results with earlier versions of the patch set.

On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila  wrote:
>
> Sure, you need to change the code such that when force_parallel_mode =
> 'regress' is specified then it always uses one worker. This is
> primarily for testing purposes and will help during the development of
> this patch as it will make all exiting Copy tests to use quite a good
> portion of the parallel infrastructure.
>

I performed force_parallel_mode = regress testing and found 2 issues,
the fixes for the same are available in v7 patch set[1].

>
> > Overall, we have below test cases to cover the code and for performance 
> > measurements. We plan to run these tests whenever a new set of patches is 
> > posted.
> >
> > 1. csv
> > 2. binary
>
> Don't we need the tests for plain text files as well?
>

I added a text use case and above mentioned are perf results on v7 patch set[1].

>
> > 3. force parallel mode = regress
> > 4. toast data csv and binary
> > 5. foreign key check, before row, after row, before statement, after 
> > statement, instead of triggers
> > 6. partition case
> > 7. foreign partitions and partitions having trigger cases
> > 8. where clause having parallel unsafe and safe expression, default 
> > parallel unsafe and safe expression
> > 9. temp, global, local, unlogged, inherited tables cases, foreign tables
> >
>
> Sounds like good coverage. So, are you doing all this testing
> manually? How are you maintaining these tests?
>

All test cases listed above, except for the cases that are meant to
measure perf gain with huge data, are present in v7-0005 patch in v7
patch set[1].

[1] 
https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

[2]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 3:50 PM Bharath Rupireddy
 wrote:
>
> On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila  wrote:
> >
> > On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  
> > > wrote:
> > > >
> > > > From the testing perspective,
> > > > 1. Test by having something force_parallel_mode = regress which means
> > > > that all existing Copy tests in the regression will be executed via
> > > > new worker code. You can have this as a test-only patch for now and
> > > > make sure all existing tests passed with this.
> > > >
> > >
> > > I don't think all the existing copy test cases(except the new test cases 
> > > added in the parallel copy patch set) would run inside the parallel 
> > > worker if force_parallel_mode is on. This is because, the parallelism 
> > > will be picked up for parallel copy only if parallel option is specified 
> > > unlike parallelism for select queries.
> > >
> >
> > Sure, you need to change the code such that when force_parallel_mode =
> > 'regress' is specified then it always uses one worker. This is
> > primarily for testing purposes and will help during the development of
> > this patch as it will make all exiting Copy tests to use quite a good
> > portion of the parallel infrastructure.
> >
>
> IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS
> as default value in guc.c,
>

No need to set this as the default value. You can change it in
postgresql.conf before running tests.

> and then adjust the parallelism related
> code in copy.c such that it always picks 1 worker and spawns it. This
> way, all the existing copy test cases would be run in parallel worker.
> Please let me know if this is okay.
>

Yeah, this sounds fine.

> If yes, I will do this and update
> here.
>

Okay, thanks, but ensure the difference in test execution before and
after your change. After your change, all the 'copy' tests should
invoke the worker to perform a copy.

> >
> > > All the above tests are performed on the latest v6 patch set (attached 
> > > here in this thread) with custom postgresql.conf[1]. The results are of 
> > > the triplet form (exec time in sec, number of workers, gain)
> > >
> >
> > Okay, so I am assuming the performance is the same as we have seen
> > with the earlier versions of patches.
> >
>
> Yes. Most recent run on v5 patch set [1]
>

Okay, good to know that.

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

2020-10-09 Thread Bharath Rupireddy

On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila  wrote:
>
> On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  wrote:
> > >
> > > From the testing perspective,
> > > 1. Test by having something force_parallel_mode = regress which means
> > > that all existing Copy tests in the regression will be executed via
> > > new worker code. You can have this as a test-only patch for now and
> > > make sure all existing tests passed with this.
> > >
> >
> > I don't think all the existing copy test cases(except the new test cases 
> > added in the parallel copy patch set) would run inside the parallel worker 
> > if force_parallel_mode is on. This is because, the parallelism will be 
> > picked up for parallel copy only if parallel option is specified unlike 
> > parallelism for select queries.
> >
>
> Sure, you need to change the code such that when force_parallel_mode =
> 'regress' is specified then it always uses one worker. This is
> primarily for testing purposes and will help during the development of
> this patch as it will make all exiting Copy tests to use quite a good
> portion of the parallel infrastructure.
>

IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS
as default value in guc.c, and then adjust the parallelism related
code in copy.c such that it always picks 1 worker and spawns it. This
way, all the existing copy test cases would be run in parallel worker.
Please let me know if this is okay. If yes, I will do this and update
here.

>
> > All the above tests are performed on the latest v6 patch set (attached here 
> > in this thread) with custom postgresql.conf[1]. The results are of the 
> > triplet form (exec time in sec, number of workers, gain)
> >
>
> Okay, so I am assuming the performance is the same as we have seen
> with the earlier versions of patches.
>

Yes. Most recent run on v5 patch set [1]

>
> > Overall, we have below test cases to cover the code and for performance 
> > measurements. We plan to run these tests whenever a new set of patches is 
> > posted.
> >
> > 1. csv
> > 2. binary
>
> Don't we need the tests for plain text files as well?
>

Will add one.

>
> > 3. force parallel mode = regress
> > 4. toast data csv and binary
> > 5. foreign key check, before row, after row, before statement, after 
> > statement, instead of triggers
> > 6. partition case
> > 7. foreign partitions and partitions having trigger cases
> > 8. where clause having parallel unsafe and safe expression, default 
> > parallel unsafe and safe expression
> > 9. temp, global, local, unlogged, inherited tables cases, foreign tables
> >
>
> Sounds like good coverage. So, are you doing all this testing
> manually? How are you maintaining these tests?
>

Yes, running them manually. Few of the tests(1,2,4) require huge
datasets for performance measurements and other test cases are to
ensure we don't choose parallelism. We will try to add test cases that
are not meant for performance, to the patch test.

[1] - 
https://www.postgresql.org/message-id/CALj2ACW%3Djm5ri%2B7rXiQaFT_c5h2rVS%3DcJOQVFR5R%2Bbowt3QDkw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
 wrote:
>
> On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  wrote:
> >
> > From the testing perspective,
> > 1. Test by having something force_parallel_mode = regress which means
> > that all existing Copy tests in the regression will be executed via
> > new worker code. You can have this as a test-only patch for now and
> > make sure all existing tests passed with this.
> >
>
> I don't think all the existing copy test cases(except the new test cases 
> added in the parallel copy patch set) would run inside the parallel worker if 
> force_parallel_mode is on. This is because, the parallelism will be picked up 
> for parallel copy only if parallel option is specified unlike parallelism for 
> select queries.
>

Sure, you need to change the code such that when force_parallel_mode =
'regress' is specified then it always uses one worker. This is
primarily for testing purposes and will help during the development of
this patch as it will make all exiting Copy tests to use quite a good
portion of the parallel infrastructure.

>
> All the above tests are performed on the latest v6 patch set (attached here 
> in this thread) with custom postgresql.conf[1]. The results are of the 
> triplet form (exec time in sec, number of workers, gain)
>

Okay, so I am assuming the performance is the same as we have seen
with the earlier versions of patches.

> Overall, we have below test cases to cover the code and for performance 
> measurements. We plan to run these tests whenever a new set of patches is 
> posted.
>
> 1. csv
> 2. binary

Don't we need the tests for plain text files as well?

> 3. force parallel mode = regress
> 4. toast data csv and binary
> 5. foreign key check, before row, after row, before statement, after 
> statement, instead of triggers
> 6. partition case
> 7. foreign partitions and partitions having trigger cases
> 8. where clause having parallel unsafe and safe expression, default parallel 
> unsafe and safe expression
> 9. temp, global, local, unlogged, inherited tables cases, foreign tables
>

Sounds like good coverage. So, are you doing all this testing
manually? How are you maintaining these tests?

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

2020-10-09 Thread Greg Nancarrow

On Fri, Oct 9, 2020 at 5:40 PM Amit Kapila  wrote:
>
> > Looking a bit deeper into this, I'm wondering if in fact your
> > EstimateStringSize() and EstimateNodeSize() functions should be using
> > BUFFERALIGN() for EACH stored string/node (rather than just calling
> > shm_toc_estimate_chunk() once at the end, after the length of packed
> > strings and nodes has been estimated), to ensure alignment of start of
> > each string/node. Other Postgres code appears to be aligning each
> > stored chunk using shm_toc_estimate_chunk(). See the definition of
> > that macro and its current usages.
> >
>
> I am not sure if this required for the purpose of correctness. AFAIU,
> we do store/estimate multiple parameters in same way at other places,
> see EstimateParamListSpace and SerializeParamList. Do you have
> something else in mind?
>

The point I was trying to make is that potentially more efficient code
can be used if the individual strings/nodes are aligned, rather than
packed (as they are now), but as you point out, there are already
cases (e.g. SerializeParamList) where within the separately-aligned
chunks the data is not aligned, so maybe not a big deal. Oh well,
without alignment, that means use of memcpy() cannot really be avoided
here for serializing/de-serializing ints etc., let's hope the compiler
optimizes it as best it can.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Parallel copy

2020-10-09 Thread Amit Kapila

On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow  wrote:
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C  wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following patch:
>
> v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) , sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> memcpy(destptr, , sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>

Your suggestion makes sense to me if the assumption related to string
length is correct. If we can't ensure that then we need to probably
use four bytes uint32 to store the length.

> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>

I am not sure if this required for the purpose of correctness. AFAIU,
we do store/estimate multiple parameters in same way at other places,
see EstimateParamListSpace and SerializeParamList. Do you have
something else in mind?

While looking at the latest code, I observed below issue in patch
v6-0003-Allow-copy-from-command-to-process-data-from-file:

+ /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */
+ est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState));
+ shm_toc_estimate_chunk(>estimator, est_cstateshared);
+ shm_toc_estimate_keys(>estimator, 1);
+
+ strsize = EstimateCstateSize(pcxt, cstate, attnamelist, ,
+ , ,
+ , ,
+ );

Here, do we need to separately estimate the size of
SerializedParallelCopyState when it is also done in
EstimateCstateSize?

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

2020-10-08 Thread Amit Kapila

On Thu, Oct 8, 2020 at 12:14 AM vignesh C  wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila  wrote:
> > > > + */
> > > > +typedef struct ParallelCopyLineBoundary
> > > >
> > > > Are we doing all this state management to avoid using locks while
> > > > processing lines?  If so, I think we can use either spinlock or LWLock
> > > > to keep the main patch simple and then provide a later patch to make
> > > > it lock-less.  This will allow us to first focus on the main design of
> > > > the patch rather than trying to make this datastructure processing
> > > > lock-less in the best possible way.
> > > >
> > >
> > > The steps will be more or less same if we use spinlock too. step 1, step 
> > > 3 & step 4 will be common we have to use lock & unlock instead of step 2 
> > > & step 5. I feel we can retain the current implementation.
> > >
> >
> > I'll study this in detail and let you know my opinion on the same but
> > in the meantime, I don't follow one part of this comment: "If they
> > don't follow this order the worker might process wrong line_size and
> > leader might populate the information which worker has not yet
> > processed or in the process of processing."
> >
> > Do you want to say that leader might overwrite some information which
> > worker hasn't read yet? If so, it is not clear from the comment.
> > Another minor point about this comment:
> >
>
> Here leader and worker must follow these steps to avoid any corruption
> or hang issue. Changed it to:
>  * The leader & worker process access the shared line information by following
>  * the below steps to avoid any data corruption or hang:
>

Actually, I wanted more on the lines why such corruption or hang can
happen? It might help reviewers to understand why you have followed
such a sequence.

> >
> > How did you ensure that this is fixed? Have you tested it, if so
> > please share the test? I see a basic problem with your fix.
> >
> > + /* Report WAL/buffer usage during parallel execution */
> > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false);
> > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false);
> > + InstrEndParallelQuery([ParallelWorkerNumber],
> > +   [ParallelWorkerNumber]);
> >
> > You need to call InstrStartParallelQuery() before the actual operation
> > starts, without that stats won't be accurate? Also, after calling
> > WaitForParallelWorkersToFinish(), you need to accumulate the stats
> > collected from workers which neither you have done nor is possible
> > with the current code in your patch because you haven't made any
> > provision to capture them in BeginParallelCopy.
> >
> > I suggest you look into lazy_parallel_vacuum_indexes() and
> > begin_parallel_vacuum() to understand how the buffer/wal usage stats
> > are accumulated. Also, please test this functionality using
> > pg_stat_statements.
> >
>
> Made changes accordingly.
> I have verified it using:
> postgres=# select * from pg_stat_statements where query like '%copy%';
>  userid | dbid  |   queryid|
>  query
>| plans | total_plan_time |
> min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time |
> calls | total_exec_time | min_exec_time | max_exec_time |
> mean_exec_time | stddev_exec_time |  rows  | shared_blks_hi
> t | shared_blks_read | shared_blks_dirtied | shared_blks_written |
> local_blks_hit | local_blks_read | local_blks_dirtied |
> local_blks_written | temp_blks_read | temp_blks_written | blk_
> read_time | blk_write_time | wal_records | wal_fpi | wal_bytes
> +---+--+-+---+-+-
> --+---++--+---+-+---+---++--++---
> --+--+-+-++-++++---+-
> --++-+-+---
>  10 | 13743 | -6947756673093447609 | copy hw from
> '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> csv, delimiter ',')   | 0 |   0 |
> 0 | 0 |  0 |0 |
>  1 |  265.195105 |265.195105 |265.195105 | 265.195105
> |0 | 175000 |191
> 6 |0 | 946 | 946 |
>  0 |   0 |  0 |  0
> |  0 | 0 |
> 0 |  0 |1116 |   0 |   3587203
>  10 | 13743 |  8570215596364326047 | copy hw from
> '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> csv, delimiter ',', parallel '2') | 0 |

Re: Parallel copy

2020-10-08 Thread Amit Kapila

On Thu, Oct 8, 2020 at 12:14 AM vignesh C  wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila  wrote:
> >
> >
> > I am convinced by the reason given by Kyotaro-San in that another
> > thread [1] and performance data shown by Peter that this can't be an
> > independent improvement and rather in some cases it can do harm. Now,
> > if you need it for a parallel-copy path then we can change it
> > specifically to the parallel-copy code path but I don't understand
> > your reason completely.
> >
>
> Whenever we need data to be populated, we will get a new data block &
> pass it to CopyGetData to populate the data. In case of file copy, the
> server will completely fill the data block. We expect the data to be
> filled completely. If data is available it will completely load the
> complete data block in case of file copy. There is no scenario where
> even if data is present a partial data block will be returned except
> for EOF or no data available. But in case of STDIN data copy, even
> though there is 8K data available in data block & 8K data available in
> STDIN, CopyGetData will return as soon as libpq buffer data is more
> than the minread. We will pass new data block every time to load data.
> Every time we pass an 8K data block but CopyGetData loads a few bytes
> in the new data block & returns. I wanted to keep the same data
> population logic for both file copy & STDIN copy i.e copy full 8K data
> blocks & then the populated data can be required. There is an
> alternative solution I can have some special handling in case of STDIN
> wherein the existing data block can be passed with the index from
> where the data should be copied. Thoughts?
>

What you are proposing as an alternative solution, isn't that what we
are doing without the patch? IIUC, you require this because of your
corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that
right? If so, what is the difficulty in making it behave similar to
the non-parallel case?

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila  wrote:
> >
> > Few additional comments:
> > ==
>
> Some more comments:
>
> v5-0002-Framework-for-leader-worker-in-parallel-copy
> ===
> 1.
> These values
> + * help in handover of multiple records with significant size of data to be
> + * processed by each of the workers to make sure there is no context
> switch & the
> + * work is fairly distributed among the workers.
>
> How about writing it as: "These values help in the handover of
> multiple records with the significant size of data to be processed by
> each of the workers. This also ensures there is no context switch and
> the work is fairly distributed among the workers."

Changed as suggested.

>
> 2. Can we keep WORKER_CHUNK_COUNT, MAX_BLOCKS_COUNT, and RINGSIZE as
> power-of-two? Say WORKER_CHUNK_COUNT as 64, MAX_BLOCK_COUNT as 1024,
> and accordingly choose RINGSIZE. At many places, we do that way. I
> think it can sometimes help in faster processing due to cache size
> requirements and in this case, I don't see a reason why we can't
> choose these values to be power-of-two. If you agree with this change
> then also do some performance testing after this change?
>

Modified as suggested, Have checked few performance tests & verified
there is no degradation. We will post a performance run of this
separately in the coming days..

> 3.
> + bool   curr_blk_completed;
> + char   data[DATA_BLOCK_SIZE]; /* data read from file */
> + uint8  skip_bytes;
> +} ParallelCopyDataBlock;
>
> Is there a reason to keep skip_bytes after data? Normally the variable
> size data is at the end of the structure. Also, there is no comment
> explaining the purpose of skip_bytes.
>

Modified as suggested and added comments.

> 4.
> + * Copy data block information.
> + * ParallelCopyDataBlock's will be created in DSM. Data read from file will 
> be
> + * copied in these DSM data blocks. The leader process identifies the records
> + * and the record information will be shared to the workers. The workers will
> + * insert the records into the table. There can be one or more number
> of records
> + * in each of the data block based on the record size.
> + */
> +typedef struct ParallelCopyDataBlock
>
> Keep one empty line after the description line like below. I also
> suggested to do a minor tweak in the above sentence which is as
> follows:
>
> * Copy data block information.
> *
> * These data blocks are created in DSM. Data read ...
>
> Try to follow a similar format in other comments as well.
>

Modified as suggested.

> 5. I think it is better to move parallelism related code to a new file
> (we can name it as copyParallel.c or something like that).
>

Modified, added copyparallel.c file to include copy parallelism
functionality & copyparallel.c file & some of the function prototype &
data structure were moved to copy.h header file so that it can be
shared between copy.c & copyparallel.c

> 6. copy.c(1648,25): warning C4133: 'function': incompatible types -
> from 'ParallelCopyLineState *' to 'uint32 *'
> Getting above compilation warning on Windows.
>

Modified the data type.

> v5-0003-Allow-copy-from-command-to-process-data-from-file
> ==
> 1.
> @@ -4294,7 +5047,7 @@ BeginCopyFrom(ParseState *pstate,
>   * only in text mode.
>   */
>   initStringInfo(>attribute_buf);
> - cstate->raw_buf = (char *) palloc(RAW_BUF_SIZE + 1);
> + cstate->raw_buf = (IsParallelCopy()) ? NULL : (char *)
> palloc(RAW_BUF_SIZE + 1);
>
> Is there anyway IsParallelCopy can be true by this time? AFAICS, we do
> anything about parallelism after this. If you want to save this
> allocation then we need to move this after we determine that
> parallelism can be used or not and accordingly the below code in the
> patch needs to be changed.
>
>  * ParallelCopyFrom - parallel copy leader's functionality.
>   *
>   * Leader executes the before statement for before statement trigger, if 
> before
> @@ -1110,8 +1547,302 @@ ParallelCopyFrom(CopyState cstate)
>   ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info;
>   ereport(DEBUG1, (errmsg("Running parallel copy leader")));
>
> + /* raw_buf is not used in parallel copy, instead data blocks are used.*/
> + pfree(cstate->raw_buf);
> + cstate->raw_buf = NULL;
>

Removed the palloc change, raw_buf will be allocated both for parallel
and non parallel copy. One other solution that I thought was to move
the memory allocation to CopyFrom, but this solution might affect fdw
w

Re: Parallel copy

On Mon, Sep 28, 2020 at 6:37 PM Ashutosh Sharma  wrote:
>
> On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila  wrote:
> >
> > On Tue, Sep 22, 2020 at 2:44 PM vignesh C  wrote:
> > >
> > > Thanks Ashutosh for your comments.
> > >
> > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma  
> > > wrote:
> > > >
> > > > Hi Vignesh,
> > > >
> > > > I've spent some time today looking at your new set of patches and I've
> > > > some thoughts and queries which I would like to put here:
> > > >
> > > > Why are these not part of the shared cstate structure?
> > > >
> > > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, 
> > > > cstate->null_print);
> > > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
> > > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
> > > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
> > > >
> > >
> > > I have used shared_cstate mainly to share the integer & bool data
> > > types from the leader to worker process. The above data types are of
> > > char* data type, I will not be able to use it like how I could do it
> > > for integer type. So I preferred to send these as separate keys to the
> > > worker. Thoughts?
> > >
> >
> > I think the way you have written will work but if we go with
> > Ashutosh's proposal it will look elegant and in the future, if we need
> > to share more strings as part of cstate structure then that would be
> > easier. You can probably refer to EstimateParamListSpace,
> > SerializeParamList, and RestoreParamList to see how we can share
> > different types of data in one key.
> >
>
> Yeah. And in addition to that it will also reduce the number of DSM
> keys that we need to maintain.
>

Thanks Ashutosh, This is handled as part of the v6 patch set.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-07 Thread Greg Nancarrow

On Thu, Oct 8, 2020 at 5:44 AM vignesh C  wrote:

> Attached v6 patch with the fixes.
>

Hi Vignesh,

I noticed a couple of issues when scanning the code in the following patch:

v6-0003-Allow-copy-from-command-to-process-data-from-file.patch

In the following code, it will put a junk uint16 value into *destptr
(and thus may well cause a crash) on a Big Endian architecture
(Solaris Sparc, s390x, etc.):
You're storing a (uint16) string length in a uint32 and then pulling
out the lower two bytes of the uint32 and copying them into the
location pointed to by destptr.

static void
+CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
+ uint32 *copiedsize)
+{
+ uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
+
+ memcpy(destptr, (uint16 *) , sizeof(uint16));
+ *copiedsize += sizeof(uint16);
+ if (len)
+ {
+ memcpy(destptr + sizeof(uint16), srcPtr, len);
+ *copiedsize += len;
+ }
+}

I suggest you change the code to:

uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
memcpy(destptr, , sizeof(uint16));

[I assume string length here can't ever exceed (65535 - 1), right?]

Looking a bit deeper into this, I'm wondering if in fact your
EstimateStringSize() and EstimateNodeSize() functions should be using
BUFFERALIGN() for EACH stored string/node (rather than just calling
shm_toc_estimate_chunk() once at the end, after the length of packed
strings and nodes has been estimated), to ensure alignment of start of
each string/node. Other Postgres code appears to be aligning each
stored chunk using shm_toc_estimate_chunk(). See the definition of
that macro and its current usages.

Then you could safely use:

uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
*(uint16 *)destptr = len;
*copiedsize += sizeof(uint16);
if (len)
{
memcpy(destptr + sizeof(uint16), srcPtr, len);
*copiedsize += len;
}

and in the CopyStringFromSharedMemory() function, then could safely use:

len = *(uint16 *)srcPtr;

The compiler may be smart enough to optimize-away the memcpy() in this
case anyway, but there are issues in doing this for architectures that
take a performance hit for unaligned access, or don't support
unaligned access.

Also, in CopyFromSharedMemory() functions, you should use palloc()
instead of palloc0(), as you're filling the entire palloc'd buffer
anyway, so no need to ask for additional MemSet() of all buffer bytes
to 0 prior to memcpy().

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Parallel copy

On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila  wrote:
>
> On Tue, Sep 22, 2020 at 2:44 PM vignesh C  wrote:
> >
> > Thanks Ashutosh for your comments.
> >
> > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma  
> > wrote:
> > >
> > > Hi Vignesh,
> > >
> > > I've spent some time today looking at your new set of patches and I've
> > > some thoughts and queries which I would like to put here:
> > >
> > > Why are these not part of the shared cstate structure?
> > >
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, 
> > > cstate->null_print);
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
> > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
> > >
> >
> > I have used shared_cstate mainly to share the integer & bool data
> > types from the leader to worker process. The above data types are of
> > char* data type, I will not be able to use it like how I could do it
> > for integer type. So I preferred to send these as separate keys to the
> > worker. Thoughts?
> >
>
> I think the way you have written will work but if we go with
> Ashutosh's proposal it will look elegant and in the future, if we need
> to share more strings as part of cstate structure then that would be
> easier. You can probably refer to EstimateParamListSpace,
> SerializeParamList, and RestoreParamList to see how we can share
> different types of data in one key.
>

Thanks for the solution Amit, I have fixed this and handled it in the
v6 patch shared in my previous mail.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy