Re: [HACKERS] PL/Python adding support for multi-dimensional arrays

2016-08-03 Thread Alexey Grishchenko
On Wed, Aug 3, 2016 at 12:49 PM, Alexey Grishchenko <agrishche...@pivotal.io
> wrote:

> Hi
>
> Current implementation of PL/Python does not allow the use of
> multi-dimensional arrays, for both input and output parameters. This forces
> end users to introduce workarounds like casting arrays to text before
> passing them to the functions and parsing them after, which is an
> error-prone approach
>
> This patch adds support for multi-dimensional arrays as both input and
> output parameters for PL/Python functions. The number of dimensions
> supported is limited by Postgres MAXDIM macrovariable, by default equal to
> 6. Both input and output multi-dimensional arrays should have fixed
> dimension sizes, i.e. 2-d arrays should represent MxN matrix, 3-d arrays
> represent MxNxK cube, etc.
>
> This patch does not support multi-dimensional arrays of composite types,
> as composite types in Python might be represented as iterators and there is
> no obvious way to find out when the nested array stops and composite type
> structure starts. For example, if we have a composite type of (int, text),
> we can try to return "[ [ [1,'a'], [2,'b'] ], [ [3,'c'], [4,'d'] ] ]", and
> it is hard to find out that the first two lists are lists, and the third
> one represents structure. Things are getting even more complex when you
> have arrays as members of composite type. This is why I think this
> limitation is reasonable.
>
> Given the function:
>
> CREATE FUNCTION test_type_conversion_array_int4(x int4[]) RETURNS int4[]
> AS $$
> plpy.info(x, type(x))
> return x
> $$ LANGUAGE plpythonu;
>
> Before patch:
>
> # SELECT * FROM test_type_conversion_array_int4(ARRAY[[1,2,3],[4,5,6]]);
> ERROR:  cannot convert multidimensional array to Python list
> DETAIL:  PL/Python only supports one-dimensional arrays.
> CONTEXT:  PL/Python function "test_type_conversion_array_int4"
>
>
> After patch:
>
> # SELECT * FROM test_type_conversion_array_int4(ARRAY[[1,2,3],[4,5,6]]);
> INFO:  ([[1, 2, 3], [4, 5, 6]], )
>  test_type_conversion_array_int4
> -
>  {{1,2,3},{4,5,6}}
> (1 row)
>
>
> --
> Best regards,
> Alexey Grishchenko
>

Also this patch incorporates the fix for
https://www.postgresql.org/message-id/CAH38_tkwA5qgLV8zPN1OpPzhtkNKQb30n3xq-2NR9jUfv3qwHA%40mail.gmail.com,
as they touch the same piece of code - array manipulation in PL/Python

-- 
Best regards,
Alexey Grishchenko


[HACKERS] PL/Python adding support for multi-dimensional arrays

2016-08-03 Thread Alexey Grishchenko
Hi

Current implementation of PL/Python does not allow the use of
multi-dimensional arrays, for both input and output parameters. This forces
end users to introduce workarounds like casting arrays to text before
passing them to the functions and parsing them after, which is an
error-prone approach

This patch adds support for multi-dimensional arrays as both input and
output parameters for PL/Python functions. The number of dimensions
supported is limited by Postgres MAXDIM macrovariable, by default equal to
6. Both input and output multi-dimensional arrays should have fixed
dimension sizes, i.e. 2-d arrays should represent MxN matrix, 3-d arrays
represent MxNxK cube, etc.

This patch does not support multi-dimensional arrays of composite types, as
composite types in Python might be represented as iterators and there is no
obvious way to find out when the nested array stops and composite type
structure starts. For example, if we have a composite type of (int, text),
we can try to return "[ [ [1,'a'], [2,'b'] ], [ [3,'c'], [4,'d'] ] ]", and
it is hard to find out that the first two lists are lists, and the third
one represents structure. Things are getting even more complex when you
have arrays as members of composite type. This is why I think this
limitation is reasonable.

Given the function:

CREATE FUNCTION test_type_conversion_array_int4(x int4[]) RETURNS int4[] AS
$$
plpy.info(x, type(x))
return x
$$ LANGUAGE plpythonu;

Before patch:

# SELECT * FROM test_type_conversion_array_int4(ARRAY[[1,2,3],[4,5,6]]);
ERROR:  cannot convert multidimensional array to Python list
DETAIL:  PL/Python only supports one-dimensional arrays.
CONTEXT:  PL/Python function "test_type_conversion_array_int4"


After patch:

# SELECT * FROM test_type_conversion_array_int4(ARRAY[[1,2,3],[4,5,6]]);
INFO:  ([[1, 2, 3], [4, 5, 6]], )
 test_type_conversion_array_int4
-
 {{1,2,3},{4,5,6}}
(1 row)


-- 
Best regards,
Alexey Grishchenko


0001-PL-Python-adding-support-for-multi-dimensional-arrays.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix for PL/Python slow input arrays traversal issue

2016-07-28 Thread Alexey Grishchenko
Hi

Following issue exists with PL/Python: when your function takes array as
input parameters, processing arrays of fixed-size elements containing null
values is many times slower than processing same array without nulls. Here
is an example:

-- Function

create or replace function test(a int8[]) returns int8 as $BODY$
return sum([x for x in a if x is not None])
$BODY$ language plpythonu volatile;

pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,10) as a
pl_regression(# ) as q;
test

 55
(1 row)

Time: 22.248 ms
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,10) as a
pl_regression(# union all
pl_regression(# select null::int8 as a
pl_regression(# ) as q;
test

 55
(1 row)

Time: 7179.921 ms


As you can see, single null in array introduces 320x slowdown. The reason
for this is following:
Original implementation uses array_ref for each element of the array. Each
call to array_ref causes subsequent call to array_seek. Function array_seek
in turn has a shortcut for fixed-size arrays with no nulls. But if your
array is not of fixed-size elements, or if it contains nulls, each call to
array_seek would cause calculation of the Kth element offset starting from
the first element. This is O(N^2) algorithm, resulting in high processing
time for arrays of non-fixed-size elements and arrays with nulls.

The fix I propose applies same logic used at array_out function for
efficient array traversal, keeping the pointer to the last fetched
element's offset, which results in dramatical performance improvement for
affected cases. With this implementation, both arrays of fixed-size
elements without nulls, fixed-size elements with nulls and variable-size
elements are processed with the same speed. Here is the test after this fix
is applied:

pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,10) as a
pl_regression(# ) as q;
test

 55
(1 row)

Time: 21.056 ms
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,10) as a
pl_regression(# union all
pl_regression(# select null::int8 as a
pl_regression(# ) as q;
test

 55
(1 row)

Time: 22.839 ms


-- 
Best regards,
Alexey Grishchenko


0001-Fix-for-PL-Python-slow-input-arrays-traversal-issue.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Endless loop calling PL/Python set returning functions

2016-03-22 Thread Alexey Grishchenko
Alexey Grishchenko <agrishche...@pivotal.io> wrote:

> Alexey Grishchenko <agrishche...@pivotal.io> wrote:
>
>> Tom Lane <t...@sss.pgh.pa.us> wrote:
>>
>>> Alexey Grishchenko <agrishche...@pivotal.io> writes:
>>> > No, my fix handles this well.
>>> > In fact, with the first function call you allocate global variables
>>> > representing Python function input parameters, call the function and
>>> > receive iterator over the function results. Then in a series of
>>> Postgres
>>> > calls to PL/Python handler you just fetch next value from the
>>> iterator, you
>>> > are not calling the Python function anymore. When the iterator reaches
>>> the
>>> > end, PL/Python call handler deallocates the global variable
>>> representing
>>> > function input parameter.
>>>
>>> > Regardless of the number of parallel invocations of the same function,
>>> each
>>> > of them in my patch would set its own input parameters to the Python
>>> > function, call the function and receive separate iterators. When the
>>> first
>>> > function's result iterator would reach its end, it would deallocate the
>>> > input global variable. But it won't affect other functions as they no
>>> > longer need to invoke any Python code.
>>>
>>> Well, if you think that works, why not undo the global-dictionary changes
>>> at the end of the first call, rather than later?  Then there's certainly
>>> no overlap in their lifespan.
>>>
>>> regards, tom lane
>>>
>>
>> Could you elaborate more on this? In general, stack-like solution would
>> work - if before the function call there is a global variable with the name
>> matching input variable name, push its value to the stack, and pop it after
>> the function execution. Would implement it tomorrow and see how it works
>>
>>
>> --
>>
>> Sent from handheld device
>>
>
> I have improved the code using proposed approach. The second version of
> patch is in attachment
>
> It works in a following way - the procedure object PLyProcedure stores
> information about the call stack depth (calldepth field) and the stack
> itself (argstack field). When the call stack depth is zero we don't make
> any additional processing, i.e. there won't be any performance impact for
> existing enduser functions. Stack manipulations are put in action only when
> the calldepth is greater than zero, which can be achieved either when the
> function is called recursively with SPI, or when you are calling the same
> set-returning function in a single query twice or more.
>
> Example of multiple calls to SRF within a single function:
>
> CREATE OR REPLACE FUNCTION func(iter int) RETURNS SETOF int AS $$
> return xrange(iter)
> $$ LANGUAGE plpythonu;
>
> select func(3), func(4);
>
>
> Before the patch query caused endless loop finishing with OOM. Now it
> works as it should
>
> Example of recursion with SPI:
>
> CREATE OR REPLACE FUNCTION test(a int) RETURNS int AS $BODY$
> r = 0
> if a > 1:
> r = plpy.execute("SELECT test(%d) as a" % (a-1))[0]['a']
> return a + r
> $BODY$ LANGUAGE plpythonu;
>
> select test(10);
>
>
> Before the patch query failed with "NameError: global name 'a' is not
> defined". Now it works correctly and returns 55
>
> --
> Best regards,
> Alexey Grishchenko
>

Hi

Any comments on this patch?

Regarding passing parameters to the Python function using globals - it was
in initial design of PL/Python (code
<https://github.com/postgres/postgres/blob/0bef7ba549977154572bdbf5682a32a07839fd82/src/pl/plpython/plpython.c#L783>,
documentation
<http://www.postgresql.org/docs/7.2/static/plpython-using.html>).
Originally you had to work with "args" global list of input parameters and
wasn't able to access the named parameters directly. And you can do so even
with the latest release. Going away from global input parameters would
require switching to PyObject_CallFunctionObjArgs
<https://docs.python.org/2/c-api/object.html#c.PyObject_CallFunctionObjArgs>,
which should be possible by changing the function declaration to include
input parameters plus "args" (for backward compatibility). However,
triggers are a bit different - they depend on modifying the global "TD"
dictionary inside the Python function, and they return only the status
string. For them, there is no option of modifying the code to avoid global
input parameters without breaking the backward compatibility with the old
enduser code

-- 
Best regards,
Alexey Grishchenko


Re: [HACKERS] Endless loop calling PL/Python set returning functions

2016-03-11 Thread Alexey Grishchenko
Alexey Grishchenko <agrishche...@pivotal.io> wrote:

> Tom Lane <t...@sss.pgh.pa.us> wrote:
>
>> Alexey Grishchenko <agrishche...@pivotal.io> writes:
>> > No, my fix handles this well.
>> > In fact, with the first function call you allocate global variables
>> > representing Python function input parameters, call the function and
>> > receive iterator over the function results. Then in a series of Postgres
>> > calls to PL/Python handler you just fetch next value from the iterator,
>> you
>> > are not calling the Python function anymore. When the iterator reaches
>> the
>> > end, PL/Python call handler deallocates the global variable representing
>> > function input parameter.
>>
>> > Regardless of the number of parallel invocations of the same function,
>> each
>> > of them in my patch would set its own input parameters to the Python
>> > function, call the function and receive separate iterators. When the
>> first
>> > function's result iterator would reach its end, it would deallocate the
>> > input global variable. But it won't affect other functions as they no
>> > longer need to invoke any Python code.
>>
>> Well, if you think that works, why not undo the global-dictionary changes
>> at the end of the first call, rather than later?  Then there's certainly
>> no overlap in their lifespan.
>>
>> regards, tom lane
>>
>
> Could you elaborate more on this? In general, stack-like solution would
> work - if before the function call there is a global variable with the name
> matching input variable name, push its value to the stack, and pop it after
> the function execution. Would implement it tomorrow and see how it works
>
>
> --
>
> Sent from handheld device
>

I have improved the code using proposed approach. The second version of
patch is in attachment

It works in a following way - the procedure object PLyProcedure stores
information about the call stack depth (calldepth field) and the stack
itself (argstack field). When the call stack depth is zero we don't make
any additional processing, i.e. there won't be any performance impact for
existing enduser functions. Stack manipulations are put in action only when
the calldepth is greater than zero, which can be achieved either when the
function is called recursively with SPI, or when you are calling the same
set-returning function in a single query twice or more.

Example of multiple calls to SRF within a single function:

CREATE OR REPLACE FUNCTION func(iter int) RETURNS SETOF int AS $$
return xrange(iter)
$$ LANGUAGE plpythonu;

select func(3), func(4);


Before the patch query caused endless loop finishing with OOM. Now it works
as it should

Example of recursion with SPI:

CREATE OR REPLACE FUNCTION test(a int) RETURNS int AS $BODY$
r = 0
if a > 1:
    r = plpy.execute("SELECT test(%d) as a" % (a-1))[0]['a']
return a + r
$BODY$ LANGUAGE plpythonu;

select test(10);


Before the patch query failed with "NameError: global name 'a' is not
defined". Now it works correctly and returns 55

-- 
Best regards,
Alexey Grishchenko


0002-Fix-endless-loop-in-plpython-set-returning-function.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Endless loop calling PL/Python set returning functions

2016-03-10 Thread Alexey Grishchenko
Tom Lane <t...@sss.pgh.pa.us> wrote:

> Alexey Grishchenko <agrishche...@pivotal.io <javascript:;>> writes:
> > No, my fix handles this well.
> > In fact, with the first function call you allocate global variables
> > representing Python function input parameters, call the function and
> > receive iterator over the function results. Then in a series of Postgres
> > calls to PL/Python handler you just fetch next value from the iterator,
> you
> > are not calling the Python function anymore. When the iterator reaches
> the
> > end, PL/Python call handler deallocates the global variable representing
> > function input parameter.
>
> > Regardless of the number of parallel invocations of the same function,
> each
> > of them in my patch would set its own input parameters to the Python
> > function, call the function and receive separate iterators. When the
> first
> > function's result iterator would reach its end, it would deallocate the
> > input global variable. But it won't affect other functions as they no
> > longer need to invoke any Python code.
>
> Well, if you think that works, why not undo the global-dictionary changes
> at the end of the first call, rather than later?  Then there's certainly
> no overlap in their lifespan.
>
> regards, tom lane
>

Could you elaborate more on this? In general, stack-like solution would
work - if before the function call there is a global variable with the name
matching input variable name, push its value to the stack, and pop it after
the function execution. Would implement it tomorrow and see how it works


-- 

Sent from handheld device


Re: [HACKERS] Endless loop calling PL/Python set returning functions

2016-03-10 Thread Alexey Grishchenko
No, my fix handles this well.

In fact, with the first function call you allocate global variables
representing Python function input parameters, call the function and
receive iterator over the function results. Then in a series of Postgres
calls to PL/Python handler you just fetch next value from the iterator, you
are not calling the Python function anymore. When the iterator reaches the
end, PL/Python call handler deallocates the global variable representing
function input parameter.

Regardless of the number of parallel invocations of the same function, each
of them in my patch would set its own input parameters to the Python
function, call the function and receive separate iterators. When the first
function's result iterator would reach its end, it would deallocate the
input global variable. But it won't affect other functions as they no
longer need to invoke any Python code. Even if they need - they would
reallocate global variable (it would be set before the Python function
invocation). The issue here was in the fact that they tried to deallocate
the global input variable multiple times independently, which caused error
that I fixed.

Regarding the patch for the second case with recursion - not caching the
"globals" between function calls would have a performance impact, as you
would have to construct "globals" object before each function call. And you
need globals as it contains references to "plpy" module and global
variables and global dictionary ("GD"). I will think on this, maybe there
might be a better design for this scenario. But I still think the second
scenario requires a separate patch

On Thu, Mar 10, 2016 at 4:33 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Alexey Grishchenko <agrishche...@pivotal.io> writes:
> > One scenario when the problem occurs, is when you are calling the same
> > set-returning function in a single query twice. This way they share the
> > same "globals" which is not a bad thing, but when one function finishes
> > execution and deallocates input parameter's global, the second will fail
> > trying to do the same. I included the fix for this problem in my patch
>
> > The second scenario when the problem occurs is when you want to call the
> > same PL/Python function in recursion. For example, this code will not
> work:
>
> Right, the recursion case is what's not being covered by this patch.
> I would rather have a single patch that deals with both of those cases,
> perhaps by *not* sharing the same dictionary across calls.  I think
> what you've done here is not so much a fix as a band-aid.  In fact,
> it doesn't even really fix the problem for the two-calls-per-query
> case does it?  It'll work if the first execution of the SRF is run to
> completion before starting the second one, but not if the two executions
> are interleaved.  I believe you can test that type of scenario with
> something like
>
>   select set_returning_function_1(...), set_returning_function_2(...);
>
> regards, tom lane
>



-- 
Best regards,
Alexey Grishchenko


Re: [HACKERS] Endless loop calling PL/Python set returning functions

2016-03-10 Thread Alexey Grishchenko
I agree that passing function parameters through globals is not the best
solution

It works in a following way - executing custom code (in our case Python
function invocation) in Python is made with PyEval_EvalCode
<https://docs.python.org/2/c-api/veryhigh.html>. As an input to this C
function you specify dictionary of globals that would be available to this
code. The structure PLyProcedure stores "PyObject *globals;", which is the
dictionary of globals for specific function. So SPI works pretty fine, as
each function has a separate dictionary of globals and they don't conflict
with each other

One scenario when the problem occurs, is when you are calling the same
set-returning function in a single query twice. This way they share the
same "globals" which is not a bad thing, but when one function finishes
execution and deallocates input parameter's global, the second will fail
trying to do the same. I included the fix for this problem in my patch

The second scenario when the problem occurs is when you want to call the
same PL/Python function in recursion. For example, this code will not work:


create or replace function test(a int) returns int as $BODY$
r = 0
if a > 1:
r = plpy.execute("SELECT test(%d) as a" % (a-1))[0]['a']
return a + r
$BODY$ language plpythonu;

select test(10);


The function "test" has a single PLyProcedure object allocated to handle
it, thus it has a single "globals" dictionary. When internal function call
finishes, it removes the key "a" from the dictionary, and the outer
function fails with "NameError: global name 'a' is not defined" when it
tries to execute "return a + r"

But the second issue is a separate story and I think it is worth a separate
patch


On Thu, Mar 10, 2016 at 3:35 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Alexey Grishchenko <agrishche...@pivotal.io> writes:
> > There is a bug in implementation of set-returning functions in PL/Python.
> > When you call the same set-returning function twice in a single query,
> the
> > executor falls to infinite loop which causes OOM.
>
> Ugh.
>
> > Another issue with calling the same set-returning function twice in the
> > same query, is that it would delete the input parameter of the function
> > from the global variables dictionary at the end of execution. With
> calling
> > the function twice, this code attempts to delete the same entry from
> global
> > variables dict twice, thus causing KeyError. This is why the
> > function PLy_function_delete_args is modified as well to check whether
> the
> > key we intend to delete is in the globals dictionary.
>
> That whole business with putting a function's parameters into a global
> dictionary makes me itch.  Doesn't it mean problems if one plpython
> function calls another (presumably via SPI)?
>
> regards, tom lane
>



-- 
Best regards,
Alexey Grishchenko


[HACKERS] Endless loop calling PL/Python set returning functions

2016-03-10 Thread Alexey Grishchenko
Hello

There is a bug in implementation of set-returning functions in PL/Python.
When you call the same set-returning function twice in a single query, the
executor falls to infinite loop which causes OOM. Here is a simple
reproduction for this issue:

CREATE OR REPLACE FUNCTION func(iter int) RETURNS SETOF int AS $$
return xrange(iter)
$$ LANGUAGE plpythonu;

select func(3), func(4);


The endless loop is caused by the fact that PL/Python uses PLyProcedure
structure for each of the functions, containing information specific for
the function. This structure is used to store the result set iterator
returned by the Python function call. But in fact, when we call the same
function twice, PL/Python uses the same structure for both calls, and the
same result set iterator (PLyProcedure.setof), which is being constantly
updated by one function after another. When the iterator reaches the end,
the first function sets it to null. Then Postgres calls the second
function, it receives NULL iterator and calls Python function once again,
receiving another iterator. This is an endless loop

In fact, for set-returning functions in Postgres we should use a set
of SRF_* functions, which gives us an access to function call context
(FuncCallContext). In my implementation this context is used to store the
iterator for function result set, so these two calls would have separate
iterators and the query would succeed.

Another issue with calling the same set-returning function twice in the
same query, is that it would delete the input parameter of the function
from the global variables dictionary at the end of execution. With calling
the function twice, this code attempts to delete the same entry from global
variables dict twice, thus causing KeyError. This is why the
function PLy_function_delete_args is modified as well to check whether the
key we intend to delete is in the globals dictionary.

New regression test is included in the patch.

-- 
Best regards,
Alexey Grishchenko


0001-Fix-endless-loop-in-plpython-set-returning-function.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers