Re: [HACKERS] Parallel COPY FROM execution
Greetings pgsql-hackers, I have completed the general infrastructure for parallel COPY FROM execution, consisting of Main (master) process and multiple BGWorkers connected with master using a personal message query (shm_mq). Master process does: - Dynamic shared memory allocation with parallel state across BGWorkers and master - Attaching every worker to the personal message query (shm_mq) - Wait workers initialization using Latch - Read raw text lines using CopyReadLine and puts them into shm_mq's via round-robin to balance queries load - When EOF is reached sends zero-length message and workers are safely shut down when receive it - Wait for worker until they complete their jobs using ConditionalVariable Each BGWorker does: - Signal master on initialization via Latch - Receive raw text lines over the personal shm_mq and put them into the log (for now) - Reinitialize db connection using the same db_id and user_id as main process - Signal master via ConditionalVariable on job done All parallel state modifications are done under LWLocks. You can find actual code here https://github.com/ololobus/postgres/pull/2/files (it is still in progress, so has a lot of duplications and comments, which are to-be-deleted) To go forward I have to overcome some obstacles: - Currently all copy.c methods are designed to work with one giant structure – CopyState. It includes buffers, many initial parameters which stay unchanged and a few variables which vary during COPY FROM execution. Since I need all these parameters, I have to obtain them somehow inside each BGWorker process. I see two possible solutions here: 1) Perform BeginCopyFrom initialization inside master and put required parameters into shared memory. However, many of them are arrays of a variable size (e.g. partition_tupconv_maps, force_notnull_flags), so I cannot put them into shmem inside one single struct. The best idea I have is to put each parameter under the personal shmem TOC key, which seems to be quite ugly. 2) Perform BeginCopyFrom initialization inside each BGWorker. However, it also opens input file/pipe for read, which is not necessary for workers and may cause some interference with master, but I can modify BeginCopyFrom. - I have used both Latch and ConditionalVariable for the same purpose–wait until some signal occurs–and for me as an end user they perform quite similar. I looked into the condition_variable.c code and it uses Latch and SpinLock under the hood. So what are differences and dis-/advantages between Latch and ConditionalVariable? I will be glad if someone will help me to find an answer to my question; also any comments and remarks to the overall COPY FROM processing architecture are very welcome. Alexey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel COPY FROM execution
On Fri, Jun 30, 2017 at 3:35 PM, Pavel Stehule <pavel.steh...@gmail.com> wrote: > > > 2017-06-30 14:23 GMT+02:00 Alex K <kondratov.alek...@gmail.com>: >> >> Thus, it results in a ~60% performance boost per each x2 multiplication of >> parallel processes, which is consistent with the initial estimation. >> > > the important use case is big table with lot of indexes. Did you test > similar case? Not yet, I will try it, thank you for a suggestion. But how much is it 'big table' and 'lot of indexes' in numbers approximately? Also, index updates and constraint checks performance are what I cannot control during COPY execution, so probably I have not to care too much about that. But of course, it is interesting, how does COPY perform in that case. Alexey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Parallel COPY FROM execution
Greetings pgsql-hackers, I am a GSOC student this year, my initial proposal has been discussed in the following thread https://www.postgresql.org/message-id/flat/7179F2FD-49CE-4093-AE14-1B26C5DFB0DA%40gmail.com Patch with COPY FROM errors handling seems to be quite finished, so I have started thinking about parallelism in COPY FROM, which is the next point in my proposal. In order to understand are there any expensive calls in COPY, which can be executed in parallel, I did a small research. First, please, find flame graph of the most expensive copy.c calls during the 'COPY FROM file' attached (copy_from.svg). It reveals, that inevitably serial operations like CopyReadLine (<15%), heap_multi_insert (~15%) take less than 50% of time in summary, while remaining operations like heap_form_tuple and multiple checks inside NextCopyFrom probably can be executed well in parallel. Second, I have compared an execution time of 'COPY FROM a single large file (~300 MB, 5000 lines)' vs. 'COPY FROM four equal parts of the original file executed in the four parallel processes'. Though it is a very rough test, it helps to obtain an overall estimation: Serial: real 0m56.571s user 0m0.005s sys 0m0.006s Parallel (x4): real 0m22.542s user 0m0.015s sys 0m0.018s Thus, it results in a ~60% performance boost per each x2 multiplication of parallel processes, which is consistent with the initial estimation. After several discussions I have two possible solutions on my mind: 1) Simple solution Let us focus only on the 'COPY FROM file', then it is relatively easy to implement, just give the same file and offset to each worker. ++ Simple; more reliable solution; probably it will give us the most possible performance boost - - Limited number of use cases. Though 'COPY FROM file' is a frequent case, even when one use it with psql \copy, client-side file read and stdin streaming to the backend are actually performed 2) True parallelism Implement a pool of bg_workers and simple shared_buffer/query. While main COPY process will read an input data and put raw lines into the query, parallel bg_workers will take lines from there and process. ++ More general solution; support of various COPY FROM use-cases - - Much more sophisticated solution; probably less performance boost compared to 1) I will be glad to any comments and criticism. Alexey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling
Hi pgsql-hackers, Thank you again for all these replies. I have started working under this project and learnt a lot of new stuff last month, so here are some new thoughts about ERRORS handling in COPY. I decided to stick to the same thread, since it has a neutral subject. (1) One of my mentors--Alvaro Herrera--suggested me to have a look on the UPSERT. It may be a good point to be able to achieve the same functionality as during the ON CONFLICT DO NOTHING, when COPY actually inserts tuples and errors handling is turned on. It could additionally reduce number of failed subtransactions and reduce XIDs consumption, while still ignoring some common errors like unique index violation. Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems to be a large separated task and is out of the current project scope, but maybe there is a relatively simple way to somehow perform internally tuples insert with ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as I understand he is the major contributor of UPSERT in PostgreSQL. It would be great if he will answer this question. (2) Otherwise, I am still going to use subtransactions via BeginInternalSubTransaction and PG_TRY / PG_CATCH with ReleaseCurrentSubTransaction / RollbackAndReleaseCurrentSubTransaction. To minimize XIDs consumption I will try to insert tuples in batches and pre-validate them as much as possible (as was suggested in the thread before). Alexey
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling
Hi Alexander! I've missed your reply, since proposal submission deadline have passed last Monday and I didn't check hackers mailing list too frequently. (1) It seems that starting new subtransaction at step 4 is not necessary. We can just gather all error lines in one pass and at the end of input start the only one additional subtransaction with all safe-lines at once: [1, ..., k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number. But assuming that the only livable use-case is when number of errors is relatively small compared to the total rows number, because if the input is in totally inconsistent format, then it seems useless to import it into the db. Thus, it is not 100% clear for me, would it be any real difference in performance, if one starts new subtransaction at step 4 or not. (2) Hmm, good question. As far as I know it is impossible to get stdin input size, thus it is impossible to distribute stdin directly to the parallel workers. The first approach which comes to the mind is to store stdin input in any kind of buffer/query and next read it in parallel by workers. The question is how it will perform in the case of large file, I guess poor, at least from the memory consumption perspective. But would parallel execution still be faster is the next question. Alexey On Thu, Apr 6, 2017 at 4:47 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > Hi, Alexey! > > On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov < > kondratov.alek...@gmail.com> wrote: > >> Thank you for your responses and valuable comments! >> >> I have written draft proposal https://docs.google.c >> om/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit >> >> It seems that COPY currently is able to return first error line and error >> type (extra or missing columns, type parse error, etc). >> Thus, the approach similar to the Stas wrote should work and, being >> optimised for a small number of error rows, should not >> affect COPY performance in such case. >> >> I will be glad to receive any critical remarks and suggestions. >> > > I've following questions about your proposal. > > 1. Suppose we have to insert N records >> 2. We create subtransaction with these N records >> 3. Error is raised on k-th line >> 4. Then, we can safely insert all lines from 1st and till (k - 1) >> > 5. Report, save to errors table or silently drop k-th line >> 6. Next, try to insert lines from (k + 1) till N with another >> subtransaction >> 7. Repeat until the end of file > > > Do you assume that we start new subtransaction in 4 since subtransaction > we started in 2 is rolled back? > > I am planning to use background worker processes for parallel COPY >> execution. Each process will receive equal piece of the input file. Since >> file is splitted by size not by lines, each worker will start import from >> the first new line to do not hit a broken line. > > > I think that situation when backend is directly reading file during COPY > is not typical. More typical case is \copy psql command. In that case > "COPY ... FROM stdin;" is actually executed while psql is streaming the > data. > How can we apply parallel COPY in this case? > > -- > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company >