Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-06-12 Thread Peter Geoghegan
On Mon, Jun 12, 2017 at 3:52 AM, Alexey Kondratov
 wrote:
> I am not going to start with "speculative insertion" right now, but it would
> be very
> useful, if you give me a point, where to start. Maybe I will at least try to
> evaluate
> the complexity of the problem.

Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c:

* SpeculativeInsertionLockAcquire()

* HeapTupleHeaderSetSpeculativeToken()

* heap_insert() called with HEAP_INSERT_SPECULATIVE argument

* ExecInsertIndexTuples() with specInsert = true

* heap_finish_speculative()

* heap_abort_speculative()

Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.

>> My advice right now is: see if you can figure out a way of doing what
>> you want without subtransactions at all, possibly by cutting some
>> scope. For example, maybe it would be satisfactory to have the
>> implementation just ignore constraint violations, but still raise
>> errors for invalid input for types.
>
>
> Initially I was thinking only about malformed rows, e.g. less or extra
> columns.
> Honestly, I did not know that there are so many levels and ways where error
> can occur.

My sense is that it's going to be hard to sell a committer on any
design that consumes subtransactions in a way that's not fairly
obvious to the user, and doesn't have a pretty easily understood worse
case. But, that's just my opinion, and it's possible that someone else
will disagree. Try to get a second opinion.

Limiting the feature to just skip rows on the basis of a formally
defined constraint failing (not including type input failure, or a
trigger throwing an error, and probably not including foreign key
failures because they're really triggers) might be a good approach.
MySQL's INSERT IGNORE is a bit like that, I think. (It doesn't *just*
ignore duplicate violations, unlike our ON CONFLICT DO NOTHING
feature).

I haven't thought about this very carefully, but I guess you could do
something like passing a flag to ExecConstraints() that indicates
"don't throw an error; instead, just return false so I know not to
proceed". Plus maybe one or two other cases, like using speculative
insertion to back out of unique violation without consuming a subxact.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-06-12 Thread Alexey Kondratov
Thank you for your comments Peter, there are some points that I did not think about before.On 9 Jun 2017, at 01:09, Peter Geoghegan  wrote:Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seemsto be a large separated task and is out of the current project scope, butmaybe there isa relatively simple way to somehow perform internally tuples insert withON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, asI understand he is the major contributor of UPSERT in PostgreSQL. It wouldbe greatif he will answer this question.I think that there is a way of making COPY use "speculativeinsertion", so that it behaves the same as ON CONFLICT DO NOTHING withno inference specification. Whether or not this is useful depends on alot of things.I am not going to start with "speculative insertion" right now, but it would be very useful, if you give me a point, where to start. Maybe I will at least try to evaluate the complexity of the problem.I think that you need to more formally identify what errors your newCOPY error handling will need to swallowMy advice right now is: see if you can figure out a way of doing whatyou want without subtransactions at all, possibly by cutting somescope. For example, maybe it would be satisfactory to have theimplementation just ignore constraint violations, but still raiseerrors for invalid input for types. Initially I was thinking only about malformed rows, e.g. less or extra columns. Honestly, I did not know that there are so many levels and ways where error can occur. So currently (and especially after your comments) I prefer to focus only on the following list of errors:1) File format issues	a. Less columns than needed	b. Extra columns2) I am doubt about type mismatch. It is possible to imagine a situation when, e.g. some integers are exported as int, and some as "int", but I am not sure that is is a common situation.3) Some constraint violations, e.g. unique index.First appeared to be easy achievable without subtransactions. I have created a proof of concept version of copy, where the errors handling is turned on by default. Please, see small patch attached (applicable to 76b11e8a43eca4612dfccfe7f3ebd293fb8a46ec) or GUI version on GitHub https://github.com/ololobus/postgres/pull/1/files. It throws warnings instead of errors for malformed lines with less/extra columns and reports line number.Second is probably achievable without subtransactions via the PG_TRY/PG_CATCH around heap_form_tuple, since it is not yet inserted into the heap.But third is questionable without subtransactions, since even if we check constraints once, there maybe various before/after triggers which can modify tuple, so it will not satisfy them. Corresponding comment inside copy.c states: "Note that a BR trigger might modify tuple such that the partition constraint is no satisfied, so we need to check in that case." Thus, there are maybe different situations here, as I understand. However, it a point where "speculative insertion"is able to help.These three cases should cover most real-life scenarios.Is there really much value in ignoring errors due to invalid encoding?Now, I have some doubts about it too. If there is an encoding problem, it is probably about the whole file, not only a few rows.Alexey

copy-errors-v0.1.patch
Description: Binary data


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-06-08 Thread Peter Geoghegan
On Wed, Jun 7, 2017 at 12:34 PM, Alex K  wrote:
> (1) One of my mentors--Alvaro Herrera--suggested me to have a look on the
> UPSERT.

> It may be a good point to be able to achieve the same functionality
> as during the ON CONFLICT DO NOTHING, when COPY actually inserts tuples
> and errors handling is turned on. It could additionally reduce number of
> failed
> subtransactions and reduce XIDs consumption, while still ignoring some
> common
> errors like unique index violation.

Alvaro and I talked about this informally at PGCon.

> Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
> to be a large separated task and is out of the current project scope, but
> maybe there is
> a relatively simple way to somehow perform internally tuples insert with
> ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
> I understand he is the major contributor of UPSERT in PostgreSQL. It would
> be great
> if he will answer this question.

I think that there is a way of making COPY use "speculative
insertion", so that it behaves the same as ON CONFLICT DO NOTHING with
no inference specification. Whether or not this is useful depends on a
lot of things.

You seem to be talking about doing this as an optimization on top of a
base feature that does the main thing you want (captures all errors
within an implementation level subxact without passing them to the
client). That could make sense, as a way of preventing extreme bloat
for a very bad case where almost all inserts have conflicts. (This
seems quite possible, whereas it seems much less likely that users
would have an input file simple full of illformed tuples.)

I think that you need to more formally identify what errors your new
COPY error handling will need to swallow. I'm not sure if it's
possible to avoid using subtransactions all together, but speculative
insertion would help if you find that you can do it without
subtransactions. Using subtransactions is always going to be a bit
ugly, because you'll need to continually reassess whether or not
you're batching insertions together at the right granularity (that is,
that you've weighed the rate of XID consumption against how much work
you lose when a batched transaction has to be "replayed" to include
things that are known to be valid). And, if you care about duplicate
violations, then you can't really be sure that replaying a "known
good" tuple will stay good from one moment to the next.

My advice right now is: see if you can figure out a way of doing what
you want without subtransactions at all, possibly by cutting some
scope. For example, maybe it would be satisfactory to have the
implementation just ignore constraint violations, but still raise
errors for invalid input for types. Is there really much value in
ignoring errors due to invalid encoding? It's not as if such problems
can be reliably detected today. If you use the wrong encoding, and
ignore some errors that COPY would generally raise, then there is an
excellent chance that you'll still insert some remaining rows with
text that has been incorrectly interpreted as valid in the database
encoding -- some text datums are bound to accidentally appear valid.
There are probably similar issues with other types. It's not clear
what the point is at which the user is no longer helped by ignoring
problems, because we cannot reliably detect *all* problems at the
level of each row.

If you must ignore errors within the input functions of types, then
maybe you can optionally let the user do that by way of a "dry run",
where the entire input file is examined for basic structural soundness
ahead of considering constraints. Any errors are saved then and there,
in a format that can be used to make sure that those entries are
skipped on a later COPY. As a further enhancement, in the future, the
user might then be able to define special transform functions that
correct the errors for those rows only. You kind of need to see all
the faulty rows together to do something like that, so a dry run could
make a lot of sense.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-06-07 Thread Alex K
Hi pgsql-hackers,

Thank you again for all these replies. I have started working under this
project
and learnt a lot of new stuff last month, so here are some new thoughts
about
ERRORS handling in COPY. I decided to stick to the same thread, since it
has a neutral subject.

(1) One of my mentors--Alvaro Herrera--suggested me to have a look on the
UPSERT. It may be a good point to be able to achieve the same functionality
as during the ON CONFLICT DO NOTHING, when COPY actually inserts tuples
and errors handling is turned on. It could additionally reduce number of
failed
subtransactions and reduce XIDs consumption, while still ignoring some
common
errors like unique index violation.

Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
to be a large separated task and is out of the current project scope, but
maybe there is
a relatively simple way to somehow perform internally tuples insert with
ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
I understand he is the major contributor of UPSERT in PostgreSQL. It would
be great
if he will answer this question.

(2) Otherwise, I am still going to use subtransactions via
BeginInternalSubTransaction
and PG_TRY / PG_CATCH with
ReleaseCurrentSubTransaction / RollbackAndReleaseCurrentSubTransaction.
To minimize XIDs consumption I will try to insert tuples in batches and
pre-validate
them as much as possible (as was suggested in the thread before).



Alexey


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-12 Thread Craig Ringer
On 13 April 2017 at 01:57, Stas Kelvich  wrote:

> However I think it worth of quick research whether it is possible to create 
> special
> code path for COPY in which errors don’t cancel transaction.

Not really. Anything at any layer of the system expects to be able to ERROR:

* datatype input functions
* CHECK constraints
* FK constraints
* unique indexes
* user defined functions run by triggers
* interrupt signalling (think deadlock detector)
* ...

and we rely on ERROR unwinding any relevant memory contexts, releasing
lwlocks, etc.

When an xact aborts it may leave all sorts of mess on disk. Nothing
gets deleted, it's just ignored due to an aborted xmin.

Maybe some xid burn could be saved by trying harder to pre-validate
batches of data as much as possible before we write anything to the
heap, sorting obviously faulty data into buffers and doing as much
work as possible before allocating a new (sub)xid and writing to the
heap. We'd still abort but we'd only be aborting a vtxid.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-12 Thread Stas Kelvich

> On 12 Apr 2017, at 20:23, Robert Haas  wrote:
> 
> On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier
>  wrote:
>> 2017-04-11 Robert Haas :
>>> If the data quality is poor (say, 50% of lines have errors) it's
>>> almost impossible to avoid runaway XID consumption.
>> 
>> Yup, that seems difficult to work around with anything similar to the
>> proposed. So the docs might need to suggest not to insert a 300 GB
>> file with 50% erroneous lines :-).
> 
> Yep.  But it does seem reasonably likely that someone might shoot
> themselves in the foot anyway.  Maybe we just live with that.
> 

Moreover if that file consists of one-byte lines (plus one byte of newline char)
then during its import xid wraparound will happens 18 times =)

I think it’s reasonable at least to have something like max_errors parameter
to COPY, that will be set by default to 1000 for example. If user will hit that
limit then it is a good moment to put a warning about possible xid consumption
in case of bigger limit.

However I think it worth of quick research whether it is possible to create 
special
code path for COPY in which errors don’t cancel transaction. At least when
COPY called outside of transaction block.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-12 Thread Robert Haas
On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier
 wrote:
> 2017-04-11 Robert Haas :
>> There's a nasty trade-off here between XID consumption (and the
>> aggressive vacuums it eventually causes) and preserving performance in
>> the face of errors - e.g. if you make k = 100,000 you consume 100x
>> fewer XIDs than if you make k = 1000, but you also have 100x the work
>> to redo (on average) every time you hit an error.
>
> You could make it dynamic: Commit the subtransaction even when not
> encountering any error after N lines (N starts out at 1), then double
> N and continue. When encountering an error, roll back the current
> subtransaction back and re-insert all the known good rows that have
> been rolled back (plus maybe the erroneous row into a separate table
> or whatever) in one new subtransaction and commit; then reset N to 1
> and continue processing the rest of the file.
>
> That would work reasonable well whenever the ratio of erroneous rows
> is not extremely high: whether the erroneous rows are all clumped
> together, entirely randomly spread out over the file, or a combination
> of both.

Right.  I wouldn't suggest the exact algorithm you proposed; I think
you ought to vary between some lower limit >1, maybe 10, and some
upper limit, maybe 1,000,000, ratcheting up and down based on how
often you hit errors in some way that might not be as simple as
doubling.  But something along those lines.

>> If the data quality is poor (say, 50% of lines have errors) it's
>> almost impossible to avoid runaway XID consumption.
>
> Yup, that seems difficult to work around with anything similar to the
> proposed. So the docs might need to suggest not to insert a 300 GB
> file with 50% erroneous lines :-).

Yep.  But it does seem reasonably likely that someone might shoot
themselves in the foot anyway.  Maybe we just live with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-12 Thread Nicolas Barbier
2017-04-11 Robert Haas :

> There's a nasty trade-off here between XID consumption (and the
> aggressive vacuums it eventually causes) and preserving performance in
> the face of errors - e.g. if you make k = 100,000 you consume 100x
> fewer XIDs than if you make k = 1000, but you also have 100x the work
> to redo (on average) every time you hit an error.

You could make it dynamic: Commit the subtransaction even when not
encountering any error after N lines (N starts out at 1), then double
N and continue. When encountering an error, roll back the current
subtransaction back and re-insert all the known good rows that have
been rolled back (plus maybe the erroneous row into a separate table
or whatever) in one new subtransaction and commit; then reset N to 1
and continue processing the rest of the file.

That would work reasonable well whenever the ratio of erroneous rows
is not extremely high: whether the erroneous rows are all clumped
together, entirely randomly spread out over the file, or a combination
of both.

> If the data quality is poor (say, 50% of lines have errors) it's
> almost impossible to avoid runaway XID consumption.

Yup, that seems difficult to work around with anything similar to the
proposed. So the docs might need to suggest not to insert a 300 GB
file with 50% erroneous lines :-).

Greetings,

Nicolas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-11 Thread Robert Haas
On Mon, Apr 10, 2017 at 2:46 PM, Alexey Kondratov
 wrote:
> Yes, sure, I don't doubt it. The question was around step 4 in the following 
> possible algorithm:
>
> 1. Suppose we have to insert N records
> 2. Start subtransaction with these N records
> 3. Error is raised on k-th line
> 4. Then, we know that we can safely insert all lines from the 1st till (k - 1)
> 5. Report, save to errors table or silently drop k-th line
> 6. Next, try to insert lines from (k + 1) till Nth with another subtransaction
> 7. Repeat until the end of file
>
> One can start subtransaction with those (k - 1) safe-lines and repeat it 
> after each error line

I don't understand what you mean by that.

> OR
> iterate till the end of file and start only one subtransaction with all lines 
> excepting error lines.

That could involve buffering a huge file.  Imagine a 300GB load.

Also consider how many XIDs whatever design is proposed will blow
through when loading 300GB of data.  There's a nasty trade-off here
between XID consumption (and the aggressive vacuums it eventually
causes) and preserving performance in the face of errors - e.g. if you
make k = 100,000 you consume 100x fewer XIDs than if you make k =
1000, but you also have 100x the work to redo (on average) every time
you hit an error.  If the data quality is poor (say, 50% of lines have
errors) it's almost impossible to avoid runaway XID consumption.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-10 Thread Alexey Kondratov
Yes, sure, I don't doubt it. The question was around step 4 in the following 
possible algorithm:

1. Suppose we have to insert N records
2. Start subtransaction with these N records
3. Error is raised on k-th line
4. Then, we know that we can safely insert all lines from the 1st till (k - 1)
5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till Nth with another subtransaction
7. Repeat until the end of file

One can start subtransaction with those (k - 1) safe-lines and repeat it after 
each error line
OR 
iterate till the end of file and start only one subtransaction with all lines 
excepting error lines.


Alexey


> On 10 Apr 2017, at 19:55, Robert Haas  wrote:
> 
> On Mon, Apr 10, 2017 at 11:39 AM, Alex K  wrote:
>> (1) It seems that starting new subtransaction at step 4 is not necessary. We
>> can just gather all error lines in one pass and at the end of input start
>> the only one additional subtransaction with all safe-lines at once: [1, ...,
>> k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number.
> 
> The only way to recover from an error is to abort the subtransaction,
> or to abort the toplevel transaction.  Anything else is unsafe.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-10 Thread Robert Haas
On Mon, Apr 10, 2017 at 11:39 AM, Alex K  wrote:
> (1) It seems that starting new subtransaction at step 4 is not necessary. We
> can just gather all error lines in one pass and at the end of input start
> the only one additional subtransaction with all safe-lines at once: [1, ...,
> k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number.

The only way to recover from an error is to abort the subtransaction,
or to abort the toplevel transaction.  Anything else is unsafe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-10 Thread Alex K
Hi Alexander!

I've missed your reply, since proposal submission deadline have passed last
Monday and I didn't check hackers mailing list too frequently.

(1) It seems that starting new subtransaction at step 4 is not necessary.
We can just gather all error lines in one pass and at the end of input
start the only one additional subtransaction with all safe-lines at once:
[1, ..., k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error
line number.

But assuming that the only livable use-case is when number of errors is
relatively small compared to the total rows number, because if the input is
in totally inconsistent format, then it seems useless to import it into the
db. Thus, it is not 100% clear for me, would it be any real difference in
performance, if one starts new subtransaction at step 4 or not.

(2) Hmm, good question. As far as I know it is impossible to get stdin
input size, thus it is impossible to distribute stdin directly to the
parallel workers. The first approach which comes to the mind is to store
stdin input in any kind of buffer/query and next read it in parallel by
workers. The question is how it will perform in the case of large file, I
guess poor, at least from the memory consumption perspective. But would
parallel execution still be faster is the next question.


Alexey



On Thu, Apr 6, 2017 at 4:47 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> Hi, Alexey!
>
> On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <
> kondratov.alek...@gmail.com> wrote:
>
>> Thank you for your responses and valuable comments!
>>
>> I have written draft proposal https://docs.google.c
>> om/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit
>>
>> It seems that COPY currently is able to return first error line and error
>> type (extra or missing columns, type parse error, etc).
>> Thus, the approach similar to the Stas wrote should work and, being
>> optimised for a small number of error rows, should not
>> affect COPY performance in such case.
>>
>> I will be glad to receive any critical remarks and suggestions.
>>
>
> I've following questions about your proposal.
>
> 1. Suppose we have to insert N records
>> 2. We create subtransaction with these N records
>> 3. Error is raised on k-th line
>> 4. Then, we can safely insert all lines from 1st and till (k - 1)
>>
> 5. Report, save to errors table or silently drop k-th line
>> 6. Next, try to insert lines from (k + 1) till N with another
>> subtransaction
>> 7. Repeat until the end of file
>
>
> Do you assume that we start new subtransaction in 4 since subtransaction
> we started in 2 is rolled back?
>
> I am planning to use background worker processes for parallel COPY
>> execution. Each process will receive equal piece of the input file. Since
>> file is splitted by size not by lines, each worker will start import from
>> the first new line to do not hit a broken line.
>
>
> I think that situation when backend is directly reading file during COPY
> is not typical.  More typical case is \copy psql command.  In that case
> "COPY ... FROM stdin;" is actually executed while psql is streaming the
> data.
> How can we apply parallel COPY in this case?
>
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-06 Thread Alexander Korotkov
Hi, Alexey!

On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <
kondratov.alek...@gmail.com> wrote:

> Thank you for your responses and valuable comments!
>
> I have written draft proposal https://docs.google.com/document/d/1Y4mc_
> PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit
>
> It seems that COPY currently is able to return first error line and error
> type (extra or missing columns, type parse error, etc).
> Thus, the approach similar to the Stas wrote should work and, being
> optimised for a small number of error rows, should not
> affect COPY performance in such case.
>
> I will be glad to receive any critical remarks and suggestions.
>

I've following questions about your proposal.

1. Suppose we have to insert N records
> 2. We create subtransaction with these N records
> 3. Error is raised on k-th line
> 4. Then, we can safely insert all lines from 1st and till (k - 1)
>
5. Report, save to errors table or silently drop k-th line
> 6. Next, try to insert lines from (k + 1) till N with another
> subtransaction
> 7. Repeat until the end of file


Do you assume that we start new subtransaction in 4 since subtransaction we
started in 2 is rolled back?

I am planning to use background worker processes for parallel COPY
> execution. Each process will receive equal piece of the input file. Since
> file is splitted by size not by lines, each worker will start import from
> the first new line to do not hit a broken line.


I think that situation when backend is directly reading file during COPY is
not typical.  More typical case is \copy psql command.  In that case "COPY
... FROM stdin;" is actually executed while psql is streaming the data.
How can we apply parallel COPY in this case?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-03-27 Thread Alexey Kondratov
Pavel, Craig and Stas,

Thank you for your responses and valuable comments!

I have written draft proposal 
https://docs.google.com/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit
 


It seems that COPY currently is able to return first error line and error type 
(extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised 
for a small number of error rows, should not 
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.


Alexey


> On 23 Mar 2017, at 17:24, Stas Kelvich  wrote:
> 
>> 
>> On 23 Mar 2017, at 15:53, Craig Ringer  wrote:
>> 
>> On 23 March 2017 at 19:33, Alexey Kondratov  
>> wrote:
>> 
>>> (1) Add errors handling to COPY as a minimum program
>> 
>> Huge +1 if you can do it in an efficient way.
>> 
>> I think the main barrier to doing so is that the naïve approach
>> creates a subtransaction for every row, which is pretty dire in
>> performance terms and burns transaction IDs very rapidly.
>> 
>> Most of our datatype I/O functions, etc, have no facility for being
>> invoked in a mode where they fail nicely and clean up after
>> themselves. We rely on unwinding the subtransaction's memory context
>> for error handling, for releasing any LWLocks that were taken, etc.
>> There's no try_timestamptz_in function or anything, just
>> timestamptz_in, and it ERROR's if it doesn't like its input. You
>> cannot safely PG_TRY / PG_CATCH such an exception and continue
>> processing to, say, write another row.
>> 
>> Currently we also don't have a way to differentiate between
>> 
>> * "this row is structurally invalid" (wrong number of columns, etc)
>> * "this row is structually valid but has fields we could not parse
>> into their data types"
>> * "this row looks structurally valid and has data types we could
>> parse, but does not satisfy a constraint on the destination table"
>> 
>> Nor do we have a way to write to any kind of failure-log table in the
>> database, since a simple approach relies on aborting subtransactions
>> to clean up failed inserts so it can't write anything for failed rows.
>> Not without starting a 2nd subxact to record the failure, anyway.
> 
> If we are optimising COPY for case with small amount of bad rows
> than 2nd subtransaction seems as not a bad idea. We can try to
> apply batch in subtx, if it fails on some row N then insert rows [1, N)
> in next subtx, report an error, commit subtx. Row N+1 can be treated
> as beginning of next batch.
> 
> 
> But if there will be some problems with handling everything with
> subtransaction and since parallelism is anyway mentioned, what about
> starting bgworker that will perform data insertion and will be controlled
> by backend?
> 
> For example backend can do following:
> 
> * Start bgworker (or even parallel worker) 
> * Get chunk of rows out of the file and try to apply them in batch
> as subtransaction in bgworker.
> * If it fails than we can open transaction in backend itself and
> raise notice / move failed rows to special errors table.
> 
>> So, having said why it's hard, I don't really have much for you in
>> terms of suggestions for ways forward. User-defined data types,
>> user-defined constraints and triggers, etc mean anything involving
>> significant interface changes will be a hard sell, especially in
>> something pretty performance-sensitive like COPY.
>> 
>> I guess it'd be worth setting out your goals first. Do you want to
>> handle all the kinds of problems above? Malformed  rows, rows with
>> malformed field values, and rows that fail to satisfy a constraint? or
>> just some subset?
>> 
>> 
>> 
>> -- 
>> Craig Ringer   http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>> 
>> 
>> -- 
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org 
>> )
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers 
>> 


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-03-23 Thread Stas Kelvich

> On 23 Mar 2017, at 15:53, Craig Ringer  wrote:
> 
> On 23 March 2017 at 19:33, Alexey Kondratov  
> wrote:
> 
>> (1) Add errors handling to COPY as a minimum program
> 
> Huge +1 if you can do it in an efficient way.
> 
> I think the main barrier to doing so is that the naïve approach
> creates a subtransaction for every row, which is pretty dire in
> performance terms and burns transaction IDs very rapidly.
> 
> Most of our datatype I/O functions, etc, have no facility for being
> invoked in a mode where they fail nicely and clean up after
> themselves. We rely on unwinding the subtransaction's memory context
> for error handling, for releasing any LWLocks that were taken, etc.
> There's no try_timestamptz_in function or anything, just
> timestamptz_in, and it ERROR's if it doesn't like its input. You
> cannot safely PG_TRY / PG_CATCH such an exception and continue
> processing to, say, write another row.
> 
> Currently we also don't have a way to differentiate between
> 
> * "this row is structurally invalid" (wrong number of columns, etc)
> * "this row is structually valid but has fields we could not parse
> into their data types"
> * "this row looks structurally valid and has data types we could
> parse, but does not satisfy a constraint on the destination table"
> 
> Nor do we have a way to write to any kind of failure-log table in the
> database, since a simple approach relies on aborting subtransactions
> to clean up failed inserts so it can't write anything for failed rows.
> Not without starting a 2nd subxact to record the failure, anyway.

If we are optimising COPY for case with small amount of bad rows
than 2nd subtransaction seems as not a bad idea. We can try to
apply batch in subtx, if it fails on some row N then insert rows [1, N)
in next subtx, report an error, commit subtx. Row N+1 can be treated
as beginning of next batch.


But if there will be some problems with handling everything with
subtransaction and since parallelism is anyway mentioned, what about
starting bgworker that will perform data insertion and will be controlled
by backend?

For example backend can do following:

* Start bgworker (or even parallel worker) 
* Get chunk of rows out of the file and try to apply them in batch
as subtransaction in bgworker.
* If it fails than we can open transaction in backend itself and
raise notice / move failed rows to special errors table.

> So, having said why it's hard, I don't really have much for you in
> terms of suggestions for ways forward. User-defined data types,
> user-defined constraints and triggers, etc mean anything involving
> significant interface changes will be a hard sell, especially in
> something pretty performance-sensitive like COPY.
> 
> I guess it'd be worth setting out your goals first. Do you want to
> handle all the kinds of problems above? Malformed  rows, rows with
> malformed field values, and rows that fail to satisfy a constraint? or
> just some subset?
> 
> 
> 
> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-03-23 Thread Craig Ringer
On 23 March 2017 at 19:33, Alexey Kondratov  wrote:

> (1) Add errors handling to COPY as a minimum program

Huge +1 if you can do it in an efficient way.

I think the main barrier to doing so is that the naïve approach
creates a subtransaction for every row, which is pretty dire in
performance terms and burns transaction IDs very rapidly.

Most of our datatype I/O functions, etc, have no facility for being
invoked in a mode where they fail nicely and clean up after
themselves. We rely on unwinding the subtransaction's memory context
for error handling, for releasing any LWLocks that were taken, etc.
There's no try_timestamptz_in function or anything, just
timestamptz_in, and it ERROR's if it doesn't like its input. You
cannot safely PG_TRY / PG_CATCH such an exception and continue
processing to, say, write another row.

Currently we also don't have a way to differentiate between

* "this row is structurally invalid" (wrong number of columns, etc)
* "this row is structually valid but has fields we could not parse
into their data types"
* "this row looks structurally valid and has data types we could
parse, but does not satisfy a constraint on the destination table"

Nor do we have a way to write to any kind of failure-log table in the
database, since a simple approach relies on aborting subtransactions
to clean up failed inserts so it can't write anything for failed rows.
Not without starting a 2nd subxact to record the failure, anyway.

So, having said why it's hard, I don't really have much for you in
terms of suggestions for ways forward. User-defined data types,
user-defined constraints and triggers, etc mean anything involving
significant interface changes will be a hard sell, especially in
something pretty performance-sensitive like COPY.

I guess it'd be worth setting out your goals first. Do you want to
handle all the kinds of problems above? Malformed  rows, rows with
malformed field values, and rows that fail to satisfy a constraint? or
just some subset?



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-03-23 Thread Pavel Stehule
>> 1) Is there anyone out of PG comunity who will be interested in such
>> project and can be a menthor?
>> 2) These two points have a general idea – to simplify work with a large
>> amount of data from a different sources, but mybe it would be better to
>> focus on the single task?
>>
>
> I spent lot of time on implementation @1 - maybe I found somewhere a
> patch. Both tasks has some common - you have to divide import to more
> batches.
>

Patch is in /dev/null :( - My implementation was based on subtransactions
for 1000 rows. When some checks fails, then I throw subtransaction and I
imported every row from block in own subtransaction. It was a prototype - I
didn't search some smarter implementation.

>
>
>
>> 3) Is it realistic to mostly finish both parts during the 3+ months of
>> almost full-time work or I am too presumptuous?
>>
>
> It is possible, I am thinking - I am not sure about all possible details,
> but basic implementation can be done in 3 months.
>

Some data, some check depends on order - it can be a problem in parallel
processing - you should to define corner cases.


>
>
>>
>> I will be very appreciate to any comments and criticism.
>>
>>
>> P.S. I know about very interesting ready projects from the PG's comunity
>> https://wiki.postgresql.org/wiki/GSoC_2017, but it always more
>> interesting to solve your own problems, issues and questions, which are the
>> product of you experience with software. That's why I dare to propose my
>> own project.
>>
>> P.P.S. A few words about me: I'm a PhD stident in Theoretical physics
>> from Moscow, Russia, and highly involved in software development since
>> 2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB,
>> C, Fortran development and basic understanding of algorithms design and
>> analysis.
>>
>>
>> Best regards,
>>
>> Alexey
>>
>
>


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-03-23 Thread Pavel Stehule
Hi

2017-03-23 12:33 GMT+01:00 Alexey Kondratov :

> Hi pgsql-hackers,
>
> I'm planning to apply to GSOC'17 and my proposal consists currently of two
> parts:
>
> (1) Add errors handling to COPY as a minimum program
>
> Motivation: Using PG on the daily basis for years I found that there are
> some cases when you need to load (e.g. for a further analytics) a bunch of
> not well consistent records with rare type/column mismatches. Since PG
> throws exception on the first error, currently the only one solution is to
> preformat your data with any other tool and then load to PG. However,
> frequently it is easier to drop certain records instead of doing such
> preprocessing for every data source you have.
>
> I have done a small research and found the item in PG's TODO
> https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push
> similar patch https://www.postgresql.org/message-id/flat/
> 603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com#
> 603c8f070909141218i291bc983t501507ebc996a...@mail.gmail.com. There were
> no negative responses against this patch and it seams that it was just
> forgoten and have not been finalized.
>
> As an example of a general idea I can provide *read_csv* method of python
> package – *pandas* (http://pandas.pydata.org/pandas-docs/stable/generated/
> pandas.read_csv.html). It uses C parser which throws error on first
> columns mismatch. However, it has two flags *error_bad_lines* and
> *warn_bad_lines*, which being set to False helps to drop bad lines or
> even hide warn messages about them.
>
>
> (2) Parallel COPY execution as a maximum program
>
> I guess that there is nothing necessary to say about motivation, it just
> should be faster on multicore CPUs.
>
> There is also an record about parallel COPY in PG's wiki
> https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some
> side extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it
> always better to have well-performing core functionality out of the box.
>
>
> My main concerns here are:
>
> 1) Is there anyone out of PG comunity who will be interested in such
> project and can be a menthor?
> 2) These two points have a general idea – to simplify work with a large
> amount of data from a different sources, but mybe it would be better to
> focus on the single task?
>

I spent lot of time on implementation @1 - maybe I found somewhere a patch.
Both tasks has some common - you have to divide import to more batches.



> 3) Is it realistic to mostly finish both parts during the 3+ months of
> almost full-time work or I am too presumptuous?
>

It is possible, I am thinking - I am not sure about all possible details,
but basic implementation can be done in 3 months.


>
> I will be very appreciate to any comments and criticism.
>
>
> P.S. I know about very interesting ready projects from the PG's comunity
> https://wiki.postgresql.org/wiki/GSoC_2017, but it always more
> interesting to solve your own problems, issues and questions, which are the
> product of you experience with software. That's why I dare to propose my
> own project.
>
> P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from
> Moscow, Russia, and highly involved in software development since 2010. I
> guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C,
> Fortran development and basic understanding of algorithms design and
> analysis.
>
>
> Best regards,
>
> Alexey
>


[HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-03-23 Thread Alexey Kondratov
Hi pgsql-hackers,

I'm planning to apply to GSOC'17 and my proposal consists currently of two 
parts:

(1) Add errors handling to COPY as a minimum program

Motivation: Using PG on the daily basis for years I found that there are some 
cases when you need to load (e.g. for a further analytics) a bunch of not well 
consistent records with rare type/column mismatches. Since PG throws exception 
on the first error, currently the only one solution is to preformat your data 
with any other tool and then load to PG. However, frequently it is easier to 
drop certain records instead of doing such preprocessing for every data source 
you have.

I have done a small research and found the item in PG's TODO 
https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar 
patch 
https://www.postgresql.org/message-id/flat/603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com#603c8f070909141218i291bc983t501507ebc996a...@mail.gmail.com.
 There were no negative responses against this patch and it seams that it was 
just forgoten and have not been finalized.

As an example of a general idea I can provide read_csv method of python package 
– pandas 
(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). 
It uses C parser which throws error on first columns mismatch. However, it has 
two flags error_bad_lines and warn_bad_lines, which being set to False helps to 
drop bad lines or even hide warn messages about them.


(2) Parallel COPY execution as a maximum program

I guess that there is nothing necessary to say about motivation, it just should 
be faster on multicore CPUs.

There is also an record about parallel COPY in PG's wiki 
https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side 
extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it always better 
to have well-performing core functionality out of the box.


My main concerns here are:

1) Is there anyone out of PG comunity who will be interested in such project 
and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount 
of data from a different sources, but mybe it would be better to focus on the 
single task?
3) Is it realistic to mostly finish both parts during the 3+ months of almost 
full-time work or I am too presumptuous?

I will be very appreciate to any comments and criticism.


P.S. I know about very interesting ready projects from the PG's comunity 
https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to 
solve your own problems, issues and questions, which are the product of you 
experience with software. That's why I dare to propose my own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from 
Moscow, Russia, and highly involved in software development since 2010. I guess 
that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran 
development and basic understanding of algorithms design and analysis.


Best regards,

Alexey