Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-06-12 Thread Alexey Kondratov
Thank you for your comments Peter, there are some points that I did not think about before.On 9 Jun 2017, at 01:09, Peter Geoghegan  wrote:Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seemsto be a large separated task and is out of the current project scope, butmaybe there isa relatively simple way to somehow perform internally tuples insert withON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, asI understand he is the major contributor of UPSERT in PostgreSQL. It wouldbe greatif he will answer this question.I think that there is a way of making COPY use "speculativeinsertion", so that it behaves the same as ON CONFLICT DO NOTHING withno inference specification. Whether or not this is useful depends on alot of things.I am not going to start with "speculative insertion" right now, but it would be very useful, if you give me a point, where to start. Maybe I will at least try to evaluate the complexity of the problem.I think that you need to more formally identify what errors your newCOPY error handling will need to swallowMy advice right now is: see if you can figure out a way of doing whatyou want without subtransactions at all, possibly by cutting somescope. For example, maybe it would be satisfactory to have theimplementation just ignore constraint violations, but still raiseerrors for invalid input for types. Initially I was thinking only about malformed rows, e.g. less or extra columns. Honestly, I did not know that there are so many levels and ways where error can occur. So currently (and especially after your comments) I prefer to focus only on the following list of errors:1) File format issues	a. Less columns than needed	b. Extra columns2) I am doubt about type mismatch. It is possible to imagine a situation when, e.g. some integers are exported as int, and some as "int", but I am not sure that is is a common situation.3) Some constraint violations, e.g. unique index.First appeared to be easy achievable without subtransactions. I have created a proof of concept version of copy, where the errors handling is turned on by default. Please, see small patch attached (applicable to 76b11e8a43eca4612dfccfe7f3ebd293fb8a46ec) or GUI version on GitHub https://github.com/ololobus/postgres/pull/1/files. It throws warnings instead of errors for malformed lines with less/extra columns and reports line number.Second is probably achievable without subtransactions via the PG_TRY/PG_CATCH around heap_form_tuple, since it is not yet inserted into the heap.But third is questionable without subtransactions, since even if we check constraints once, there maybe various before/after triggers which can modify tuple, so it will not satisfy them. Corresponding comment inside copy.c states: "Note that a BR trigger might modify tuple such that the partition constraint is no satisfied, so we need to check in that case." Thus, there are maybe different situations here, as I understand. However, it a point where "speculative insertion"is able to help.These three cases should cover most real-life scenarios.Is there really much value in ignoring errors due to invalid encoding?Now, I have some doubts about it too. If there is an encoding problem, it is probably about the whole file, not only a few rows.Alexey

copy-errors-v0.1.patch
Description: Binary data


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-10 Thread Alexey Kondratov
Yes, sure, I don't doubt it. The question was around step 4 in the following 
possible algorithm:

1. Suppose we have to insert N records
2. Start subtransaction with these N records
3. Error is raised on k-th line
4. Then, we know that we can safely insert all lines from the 1st till (k - 1)
5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till Nth with another subtransaction
7. Repeat until the end of file

One can start subtransaction with those (k - 1) safe-lines and repeat it after 
each error line
OR 
iterate till the end of file and start only one subtransaction with all lines 
excepting error lines.


Alexey


> On 10 Apr 2017, at 19:55, Robert Haas  wrote:
> 
> On Mon, Apr 10, 2017 at 11:39 AM, Alex K  wrote:
>> (1) It seems that starting new subtransaction at step 4 is not necessary. We
>> can just gather all error lines in one pass and at the end of input start
>> the only one additional subtransaction with all safe-lines at once: [1, ...,
>> k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number.
> 
> The only way to recover from an error is to abort the subtransaction,
> or to abort the toplevel transaction.  Anything else is unsafe.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-03-27 Thread Alexey Kondratov
Pavel, Craig and Stas,

Thank you for your responses and valuable comments!

I have written draft proposal 
https://docs.google.com/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit
 
<https://docs.google.com/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit>

It seems that COPY currently is able to return first error line and error type 
(extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised 
for a small number of error rows, should not 
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.


Alexey


> On 23 Mar 2017, at 17:24, Stas Kelvich <stas.kelv...@gmail.com> wrote:
> 
>> 
>> On 23 Mar 2017, at 15:53, Craig Ringer <cr...@2ndquadrant.com> wrote:
>> 
>> On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.alek...@gmail.com> 
>> wrote:
>> 
>>> (1) Add errors handling to COPY as a minimum program
>> 
>> Huge +1 if you can do it in an efficient way.
>> 
>> I think the main barrier to doing so is that the naïve approach
>> creates a subtransaction for every row, which is pretty dire in
>> performance terms and burns transaction IDs very rapidly.
>> 
>> Most of our datatype I/O functions, etc, have no facility for being
>> invoked in a mode where they fail nicely and clean up after
>> themselves. We rely on unwinding the subtransaction's memory context
>> for error handling, for releasing any LWLocks that were taken, etc.
>> There's no try_timestamptz_in function or anything, just
>> timestamptz_in, and it ERROR's if it doesn't like its input. You
>> cannot safely PG_TRY / PG_CATCH such an exception and continue
>> processing to, say, write another row.
>> 
>> Currently we also don't have a way to differentiate between
>> 
>> * "this row is structurally invalid" (wrong number of columns, etc)
>> * "this row is structually valid but has fields we could not parse
>> into their data types"
>> * "this row looks structurally valid and has data types we could
>> parse, but does not satisfy a constraint on the destination table"
>> 
>> Nor do we have a way to write to any kind of failure-log table in the
>> database, since a simple approach relies on aborting subtransactions
>> to clean up failed inserts so it can't write anything for failed rows.
>> Not without starting a 2nd subxact to record the failure, anyway.
> 
> If we are optimising COPY for case with small amount of bad rows
> than 2nd subtransaction seems as not a bad idea. We can try to
> apply batch in subtx, if it fails on some row N then insert rows [1, N)
> in next subtx, report an error, commit subtx. Row N+1 can be treated
> as beginning of next batch.
> 
> 
> But if there will be some problems with handling everything with
> subtransaction and since parallelism is anyway mentioned, what about
> starting bgworker that will perform data insertion and will be controlled
> by backend?
> 
> For example backend can do following:
> 
> * Start bgworker (or even parallel worker) 
> * Get chunk of rows out of the file and try to apply them in batch
> as subtransaction in bgworker.
> * If it fails than we can open transaction in backend itself and
> raise notice / move failed rows to special errors table.
> 
>> So, having said why it's hard, I don't really have much for you in
>> terms of suggestions for ways forward. User-defined data types,
>> user-defined constraints and triggers, etc mean anything involving
>> significant interface changes will be a hard sell, especially in
>> something pretty performance-sensitive like COPY.
>> 
>> I guess it'd be worth setting out your goals first. Do you want to
>> handle all the kinds of problems above? Malformed  rows, rows with
>> malformed field values, and rows that fail to satisfy a constraint? or
>> just some subset?
>> 
>> 
>> 
>> -- 
>> Craig Ringer   http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>> 
>> 
>> -- 
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org 
>> <mailto:pgsql-hackers@postgresql.org>)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers 
>> <http://www.postgresql.org/mailpref/pgsql-hackers>


[HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-03-23 Thread Alexey Kondratov
Hi pgsql-hackers,

I'm planning to apply to GSOC'17 and my proposal consists currently of two 
parts:

(1) Add errors handling to COPY as a minimum program

Motivation: Using PG on the daily basis for years I found that there are some 
cases when you need to load (e.g. for a further analytics) a bunch of not well 
consistent records with rare type/column mismatches. Since PG throws exception 
on the first error, currently the only one solution is to preformat your data 
with any other tool and then load to PG. However, frequently it is easier to 
drop certain records instead of doing such preprocessing for every data source 
you have.

I have done a small research and found the item in PG's TODO 
https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar 
patch 
https://www.postgresql.org/message-id/flat/603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com#603c8f070909141218i291bc983t501507ebc996a...@mail.gmail.com.
 There were no negative responses against this patch and it seams that it was 
just forgoten and have not been finalized.

As an example of a general idea I can provide read_csv method of python package 
– pandas 
(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). 
It uses C parser which throws error on first columns mismatch. However, it has 
two flags error_bad_lines and warn_bad_lines, which being set to False helps to 
drop bad lines or even hide warn messages about them.


(2) Parallel COPY execution as a maximum program

I guess that there is nothing necessary to say about motivation, it just should 
be faster on multicore CPUs.

There is also an record about parallel COPY in PG's wiki 
https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side 
extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it always better 
to have well-performing core functionality out of the box.


My main concerns here are:

1) Is there anyone out of PG comunity who will be interested in such project 
and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount 
of data from a different sources, but mybe it would be better to focus on the 
single task?
3) Is it realistic to mostly finish both parts during the 3+ months of almost 
full-time work or I am too presumptuous?

I will be very appreciate to any comments and criticism.


P.S. I know about very interesting ready projects from the PG's comunity 
https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to 
solve your own problems, issues and questions, which are the product of you 
experience with software. That's why I dare to propose my own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from 
Moscow, Russia, and highly involved in software development since 2010. I guess 
that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran 
development and basic understanding of algorithms design and analysis.


Best regards,

Alexey