Re: [HACKERS] PL/Python adding support for multi-dimensional arrays
On Wed, Aug 3, 2016 at 12:49 PM, Alexey Grishchenko <agrishche...@pivotal.io > wrote: > Hi > > Current implementation of PL/Python does not allow the use of > multi-dimensional arrays, for both input and output parameters. This forces > end users to introduce workarounds like casting arrays to text before > passing them to the functions and parsing them after, which is an > error-prone approach > > This patch adds support for multi-dimensional arrays as both input and > output parameters for PL/Python functions. The number of dimensions > supported is limited by Postgres MAXDIM macrovariable, by default equal to > 6. Both input and output multi-dimensional arrays should have fixed > dimension sizes, i.e. 2-d arrays should represent MxN matrix, 3-d arrays > represent MxNxK cube, etc. > > This patch does not support multi-dimensional arrays of composite types, > as composite types in Python might be represented as iterators and there is > no obvious way to find out when the nested array stops and composite type > structure starts. For example, if we have a composite type of (int, text), > we can try to return "[ [ [1,'a'], [2,'b'] ], [ [3,'c'], [4,'d'] ] ]", and > it is hard to find out that the first two lists are lists, and the third > one represents structure. Things are getting even more complex when you > have arrays as members of composite type. This is why I think this > limitation is reasonable. > > Given the function: > > CREATE FUNCTION test_type_conversion_array_int4(x int4[]) RETURNS int4[] > AS $$ > plpy.info(x, type(x)) > return x > $$ LANGUAGE plpythonu; > > Before patch: > > # SELECT * FROM test_type_conversion_array_int4(ARRAY[[1,2,3],[4,5,6]]); > ERROR: cannot convert multidimensional array to Python list > DETAIL: PL/Python only supports one-dimensional arrays. > CONTEXT: PL/Python function "test_type_conversion_array_int4" > > > After patch: > > # SELECT * FROM test_type_conversion_array_int4(ARRAY[[1,2,3],[4,5,6]]); > INFO: ([[1, 2, 3], [4, 5, 6]], ) > test_type_conversion_array_int4 > - > {{1,2,3},{4,5,6}} > (1 row) > > > -- > Best regards, > Alexey Grishchenko > Also this patch incorporates the fix for https://www.postgresql.org/message-id/CAH38_tkwA5qgLV8zPN1OpPzhtkNKQb30n3xq-2NR9jUfv3qwHA%40mail.gmail.com, as they touch the same piece of code - array manipulation in PL/Python -- Best regards, Alexey Grishchenko
[HACKERS] PL/Python adding support for multi-dimensional arrays
Hi Current implementation of PL/Python does not allow the use of multi-dimensional arrays, for both input and output parameters. This forces end users to introduce workarounds like casting arrays to text before passing them to the functions and parsing them after, which is an error-prone approach This patch adds support for multi-dimensional arrays as both input and output parameters for PL/Python functions. The number of dimensions supported is limited by Postgres MAXDIM macrovariable, by default equal to 6. Both input and output multi-dimensional arrays should have fixed dimension sizes, i.e. 2-d arrays should represent MxN matrix, 3-d arrays represent MxNxK cube, etc. This patch does not support multi-dimensional arrays of composite types, as composite types in Python might be represented as iterators and there is no obvious way to find out when the nested array stops and composite type structure starts. For example, if we have a composite type of (int, text), we can try to return "[ [ [1,'a'], [2,'b'] ], [ [3,'c'], [4,'d'] ] ]", and it is hard to find out that the first two lists are lists, and the third one represents structure. Things are getting even more complex when you have arrays as members of composite type. This is why I think this limitation is reasonable. Given the function: CREATE FUNCTION test_type_conversion_array_int4(x int4[]) RETURNS int4[] AS $$ plpy.info(x, type(x)) return x $$ LANGUAGE plpythonu; Before patch: # SELECT * FROM test_type_conversion_array_int4(ARRAY[[1,2,3],[4,5,6]]); ERROR: cannot convert multidimensional array to Python list DETAIL: PL/Python only supports one-dimensional arrays. CONTEXT: PL/Python function "test_type_conversion_array_int4" After patch: # SELECT * FROM test_type_conversion_array_int4(ARRAY[[1,2,3],[4,5,6]]); INFO: ([[1, 2, 3], [4, 5, 6]], ) test_type_conversion_array_int4 - {{1,2,3},{4,5,6}} (1 row) -- Best regards, Alexey Grishchenko 0001-PL-Python-adding-support-for-multi-dimensional-arrays.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fix for PL/Python slow input arrays traversal issue
Hi Following issue exists with PL/Python: when your function takes array as input parameters, processing arrays of fixed-size elements containing null values is many times slower than processing same array without nulls. Here is an example: -- Function create or replace function test(a int8[]) returns int8 as $BODY$ return sum([x for x in a if x is not None]) $BODY$ language plpythonu volatile; pl_regression=# select test(array_agg(a)::int8[]) pl_regression-# from ( pl_regression(# select generate_series(1,10) as a pl_regression(# ) as q; test 55 (1 row) Time: 22.248 ms pl_regression=# select test(array_agg(a)::int8[]) pl_regression-# from ( pl_regression(# select generate_series(1,10) as a pl_regression(# union all pl_regression(# select null::int8 as a pl_regression(# ) as q; test 55 (1 row) Time: 7179.921 ms As you can see, single null in array introduces 320x slowdown. The reason for this is following: Original implementation uses array_ref for each element of the array. Each call to array_ref causes subsequent call to array_seek. Function array_seek in turn has a shortcut for fixed-size arrays with no nulls. But if your array is not of fixed-size elements, or if it contains nulls, each call to array_seek would cause calculation of the Kth element offset starting from the first element. This is O(N^2) algorithm, resulting in high processing time for arrays of non-fixed-size elements and arrays with nulls. The fix I propose applies same logic used at array_out function for efficient array traversal, keeping the pointer to the last fetched element's offset, which results in dramatical performance improvement for affected cases. With this implementation, both arrays of fixed-size elements without nulls, fixed-size elements with nulls and variable-size elements are processed with the same speed. Here is the test after this fix is applied: pl_regression=# select test(array_agg(a)::int8[]) pl_regression-# from ( pl_regression(# select generate_series(1,10) as a pl_regression(# ) as q; test 55 (1 row) Time: 21.056 ms pl_regression=# select test(array_agg(a)::int8[]) pl_regression-# from ( pl_regression(# select generate_series(1,10) as a pl_regression(# union all pl_regression(# select null::int8 as a pl_regression(# ) as q; test 55 (1 row) Time: 22.839 ms -- Best regards, Alexey Grishchenko 0001-Fix-for-PL-Python-slow-input-arrays-traversal-issue.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Endless loop calling PL/Python set returning functions
Alexey Grishchenko <agrishche...@pivotal.io> wrote: > Alexey Grishchenko <agrishche...@pivotal.io> wrote: > >> Tom Lane <t...@sss.pgh.pa.us> wrote: >> >>> Alexey Grishchenko <agrishche...@pivotal.io> writes: >>> > No, my fix handles this well. >>> > In fact, with the first function call you allocate global variables >>> > representing Python function input parameters, call the function and >>> > receive iterator over the function results. Then in a series of >>> Postgres >>> > calls to PL/Python handler you just fetch next value from the >>> iterator, you >>> > are not calling the Python function anymore. When the iterator reaches >>> the >>> > end, PL/Python call handler deallocates the global variable >>> representing >>> > function input parameter. >>> >>> > Regardless of the number of parallel invocations of the same function, >>> each >>> > of them in my patch would set its own input parameters to the Python >>> > function, call the function and receive separate iterators. When the >>> first >>> > function's result iterator would reach its end, it would deallocate the >>> > input global variable. But it won't affect other functions as they no >>> > longer need to invoke any Python code. >>> >>> Well, if you think that works, why not undo the global-dictionary changes >>> at the end of the first call, rather than later? Then there's certainly >>> no overlap in their lifespan. >>> >>> regards, tom lane >>> >> >> Could you elaborate more on this? In general, stack-like solution would >> work - if before the function call there is a global variable with the name >> matching input variable name, push its value to the stack, and pop it after >> the function execution. Would implement it tomorrow and see how it works >> >> >> -- >> >> Sent from handheld device >> > > I have improved the code using proposed approach. The second version of > patch is in attachment > > It works in a following way - the procedure object PLyProcedure stores > information about the call stack depth (calldepth field) and the stack > itself (argstack field). When the call stack depth is zero we don't make > any additional processing, i.e. there won't be any performance impact for > existing enduser functions. Stack manipulations are put in action only when > the calldepth is greater than zero, which can be achieved either when the > function is called recursively with SPI, or when you are calling the same > set-returning function in a single query twice or more. > > Example of multiple calls to SRF within a single function: > > CREATE OR REPLACE FUNCTION func(iter int) RETURNS SETOF int AS $$ > return xrange(iter) > $$ LANGUAGE plpythonu; > > select func(3), func(4); > > > Before the patch query caused endless loop finishing with OOM. Now it > works as it should > > Example of recursion with SPI: > > CREATE OR REPLACE FUNCTION test(a int) RETURNS int AS $BODY$ > r = 0 > if a > 1: > r = plpy.execute("SELECT test(%d) as a" % (a-1))[0]['a'] > return a + r > $BODY$ LANGUAGE plpythonu; > > select test(10); > > > Before the patch query failed with "NameError: global name 'a' is not > defined". Now it works correctly and returns 55 > > -- > Best regards, > Alexey Grishchenko > Hi Any comments on this patch? Regarding passing parameters to the Python function using globals - it was in initial design of PL/Python (code <https://github.com/postgres/postgres/blob/0bef7ba549977154572bdbf5682a32a07839fd82/src/pl/plpython/plpython.c#L783>, documentation <http://www.postgresql.org/docs/7.2/static/plpython-using.html>). Originally you had to work with "args" global list of input parameters and wasn't able to access the named parameters directly. And you can do so even with the latest release. Going away from global input parameters would require switching to PyObject_CallFunctionObjArgs <https://docs.python.org/2/c-api/object.html#c.PyObject_CallFunctionObjArgs>, which should be possible by changing the function declaration to include input parameters plus "args" (for backward compatibility). However, triggers are a bit different - they depend on modifying the global "TD" dictionary inside the Python function, and they return only the status string. For them, there is no option of modifying the code to avoid global input parameters without breaking the backward compatibility with the old enduser code -- Best regards, Alexey Grishchenko
Re: [HACKERS] Endless loop calling PL/Python set returning functions
Alexey Grishchenko <agrishche...@pivotal.io> wrote: > Tom Lane <t...@sss.pgh.pa.us> wrote: > >> Alexey Grishchenko <agrishche...@pivotal.io> writes: >> > No, my fix handles this well. >> > In fact, with the first function call you allocate global variables >> > representing Python function input parameters, call the function and >> > receive iterator over the function results. Then in a series of Postgres >> > calls to PL/Python handler you just fetch next value from the iterator, >> you >> > are not calling the Python function anymore. When the iterator reaches >> the >> > end, PL/Python call handler deallocates the global variable representing >> > function input parameter. >> >> > Regardless of the number of parallel invocations of the same function, >> each >> > of them in my patch would set its own input parameters to the Python >> > function, call the function and receive separate iterators. When the >> first >> > function's result iterator would reach its end, it would deallocate the >> > input global variable. But it won't affect other functions as they no >> > longer need to invoke any Python code. >> >> Well, if you think that works, why not undo the global-dictionary changes >> at the end of the first call, rather than later? Then there's certainly >> no overlap in their lifespan. >> >> regards, tom lane >> > > Could you elaborate more on this? In general, stack-like solution would > work - if before the function call there is a global variable with the name > matching input variable name, push its value to the stack, and pop it after > the function execution. Would implement it tomorrow and see how it works > > > -- > > Sent from handheld device > I have improved the code using proposed approach. The second version of patch is in attachment It works in a following way - the procedure object PLyProcedure stores information about the call stack depth (calldepth field) and the stack itself (argstack field). When the call stack depth is zero we don't make any additional processing, i.e. there won't be any performance impact for existing enduser functions. Stack manipulations are put in action only when the calldepth is greater than zero, which can be achieved either when the function is called recursively with SPI, or when you are calling the same set-returning function in a single query twice or more. Example of multiple calls to SRF within a single function: CREATE OR REPLACE FUNCTION func(iter int) RETURNS SETOF int AS $$ return xrange(iter) $$ LANGUAGE plpythonu; select func(3), func(4); Before the patch query caused endless loop finishing with OOM. Now it works as it should Example of recursion with SPI: CREATE OR REPLACE FUNCTION test(a int) RETURNS int AS $BODY$ r = 0 if a > 1: r = plpy.execute("SELECT test(%d) as a" % (a-1))[0]['a'] return a + r $BODY$ LANGUAGE plpythonu; select test(10); Before the patch query failed with "NameError: global name 'a' is not defined". Now it works correctly and returns 55 -- Best regards, Alexey Grishchenko 0002-Fix-endless-loop-in-plpython-set-returning-function.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Endless loop calling PL/Python set returning functions
Tom Lane <t...@sss.pgh.pa.us> wrote: > Alexey Grishchenko <agrishche...@pivotal.io <javascript:;>> writes: > > No, my fix handles this well. > > In fact, with the first function call you allocate global variables > > representing Python function input parameters, call the function and > > receive iterator over the function results. Then in a series of Postgres > > calls to PL/Python handler you just fetch next value from the iterator, > you > > are not calling the Python function anymore. When the iterator reaches > the > > end, PL/Python call handler deallocates the global variable representing > > function input parameter. > > > Regardless of the number of parallel invocations of the same function, > each > > of them in my patch would set its own input parameters to the Python > > function, call the function and receive separate iterators. When the > first > > function's result iterator would reach its end, it would deallocate the > > input global variable. But it won't affect other functions as they no > > longer need to invoke any Python code. > > Well, if you think that works, why not undo the global-dictionary changes > at the end of the first call, rather than later? Then there's certainly > no overlap in their lifespan. > > regards, tom lane > Could you elaborate more on this? In general, stack-like solution would work - if before the function call there is a global variable with the name matching input variable name, push its value to the stack, and pop it after the function execution. Would implement it tomorrow and see how it works -- Sent from handheld device
Re: [HACKERS] Endless loop calling PL/Python set returning functions
No, my fix handles this well. In fact, with the first function call you allocate global variables representing Python function input parameters, call the function and receive iterator over the function results. Then in a series of Postgres calls to PL/Python handler you just fetch next value from the iterator, you are not calling the Python function anymore. When the iterator reaches the end, PL/Python call handler deallocates the global variable representing function input parameter. Regardless of the number of parallel invocations of the same function, each of them in my patch would set its own input parameters to the Python function, call the function and receive separate iterators. When the first function's result iterator would reach its end, it would deallocate the input global variable. But it won't affect other functions as they no longer need to invoke any Python code. Even if they need - they would reallocate global variable (it would be set before the Python function invocation). The issue here was in the fact that they tried to deallocate the global input variable multiple times independently, which caused error that I fixed. Regarding the patch for the second case with recursion - not caching the "globals" between function calls would have a performance impact, as you would have to construct "globals" object before each function call. And you need globals as it contains references to "plpy" module and global variables and global dictionary ("GD"). I will think on this, maybe there might be a better design for this scenario. But I still think the second scenario requires a separate patch On Thu, Mar 10, 2016 at 4:33 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Alexey Grishchenko <agrishche...@pivotal.io> writes: > > One scenario when the problem occurs, is when you are calling the same > > set-returning function in a single query twice. This way they share the > > same "globals" which is not a bad thing, but when one function finishes > > execution and deallocates input parameter's global, the second will fail > > trying to do the same. I included the fix for this problem in my patch > > > The second scenario when the problem occurs is when you want to call the > > same PL/Python function in recursion. For example, this code will not > work: > > Right, the recursion case is what's not being covered by this patch. > I would rather have a single patch that deals with both of those cases, > perhaps by *not* sharing the same dictionary across calls. I think > what you've done here is not so much a fix as a band-aid. In fact, > it doesn't even really fix the problem for the two-calls-per-query > case does it? It'll work if the first execution of the SRF is run to > completion before starting the second one, but not if the two executions > are interleaved. I believe you can test that type of scenario with > something like > > select set_returning_function_1(...), set_returning_function_2(...); > > regards, tom lane > -- Best regards, Alexey Grishchenko
Re: [HACKERS] Endless loop calling PL/Python set returning functions
I agree that passing function parameters through globals is not the best solution It works in a following way - executing custom code (in our case Python function invocation) in Python is made with PyEval_EvalCode <https://docs.python.org/2/c-api/veryhigh.html>. As an input to this C function you specify dictionary of globals that would be available to this code. The structure PLyProcedure stores "PyObject *globals;", which is the dictionary of globals for specific function. So SPI works pretty fine, as each function has a separate dictionary of globals and they don't conflict with each other One scenario when the problem occurs, is when you are calling the same set-returning function in a single query twice. This way they share the same "globals" which is not a bad thing, but when one function finishes execution and deallocates input parameter's global, the second will fail trying to do the same. I included the fix for this problem in my patch The second scenario when the problem occurs is when you want to call the same PL/Python function in recursion. For example, this code will not work: create or replace function test(a int) returns int as $BODY$ r = 0 if a > 1: r = plpy.execute("SELECT test(%d) as a" % (a-1))[0]['a'] return a + r $BODY$ language plpythonu; select test(10); The function "test" has a single PLyProcedure object allocated to handle it, thus it has a single "globals" dictionary. When internal function call finishes, it removes the key "a" from the dictionary, and the outer function fails with "NameError: global name 'a' is not defined" when it tries to execute "return a + r" But the second issue is a separate story and I think it is worth a separate patch On Thu, Mar 10, 2016 at 3:35 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Alexey Grishchenko <agrishche...@pivotal.io> writes: > > There is a bug in implementation of set-returning functions in PL/Python. > > When you call the same set-returning function twice in a single query, > the > > executor falls to infinite loop which causes OOM. > > Ugh. > > > Another issue with calling the same set-returning function twice in the > > same query, is that it would delete the input parameter of the function > > from the global variables dictionary at the end of execution. With > calling > > the function twice, this code attempts to delete the same entry from > global > > variables dict twice, thus causing KeyError. This is why the > > function PLy_function_delete_args is modified as well to check whether > the > > key we intend to delete is in the globals dictionary. > > That whole business with putting a function's parameters into a global > dictionary makes me itch. Doesn't it mean problems if one plpython > function calls another (presumably via SPI)? > > regards, tom lane > -- Best regards, Alexey Grishchenko
[HACKERS] Endless loop calling PL/Python set returning functions
Hello There is a bug in implementation of set-returning functions in PL/Python. When you call the same set-returning function twice in a single query, the executor falls to infinite loop which causes OOM. Here is a simple reproduction for this issue: CREATE OR REPLACE FUNCTION func(iter int) RETURNS SETOF int AS $$ return xrange(iter) $$ LANGUAGE plpythonu; select func(3), func(4); The endless loop is caused by the fact that PL/Python uses PLyProcedure structure for each of the functions, containing information specific for the function. This structure is used to store the result set iterator returned by the Python function call. But in fact, when we call the same function twice, PL/Python uses the same structure for both calls, and the same result set iterator (PLyProcedure.setof), which is being constantly updated by one function after another. When the iterator reaches the end, the first function sets it to null. Then Postgres calls the second function, it receives NULL iterator and calls Python function once again, receiving another iterator. This is an endless loop In fact, for set-returning functions in Postgres we should use a set of SRF_* functions, which gives us an access to function call context (FuncCallContext). In my implementation this context is used to store the iterator for function result set, so these two calls would have separate iterators and the query would succeed. Another issue with calling the same set-returning function twice in the same query, is that it would delete the input parameter of the function from the global variables dictionary at the end of execution. With calling the function twice, this code attempts to delete the same entry from global variables dict twice, thus causing KeyError. This is why the function PLy_function_delete_args is modified as well to check whether the key we intend to delete is in the globals dictionary. New regression test is included in the patch. -- Best regards, Alexey Grishchenko 0001-Fix-endless-loop-in-plpython-set-returning-function.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers