[PERFORM] Postgres array parser
Hello, I wanted to ask according such a problem which we had faced with. We are widely using postgres arrays like key->value array by doing like this: {{1,5},{2,6},{3,7}} where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions we are using self written array_input(array::numeric[], key::numeric) function which makes a loop on whole array and searches for key like FOR i IN 1 .. size LOOP if array[i][1] = key then return array[i][2]; end if; END LOOP; But this was a good solution until our arrays and database had grown. So now FOR loop takes a lot of time to find value of an array. And my question is, how this problem of performance could be solved? We had tried pgperl for string parsing, but it takes much more time than our current solution. Also we are thinking about self-written C++ function, may be someone had implemented this algorithm before? -- Best regards Aleksej Trofimov -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres array parser
Hello do you know FOREACH IN ARRAY statement in 9.1 this significantly accelerate iteration over array http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-foreach-in-array/ 2011/12/13 Aleksej Trofimov : > Hello, I wanted to ask according such a problem which we had faced with. > We are widely using postgres arrays like key->value array by doing like > this: > > {{1,5},{2,6},{3,7}} > > where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions we are > using self written array_input(array::numeric[], key::numeric) function > which makes a loop on whole array and searches for key like > FOR i IN 1 .. size LOOP > if array[i][1] = key then > return array[i][2]; > end if; > END LOOP; > > But this was a good solution until our arrays and database had grown. So now > FOR loop takes a lot of time to find value of an array. > > And my question is, how this problem of performance could be solved? We had > tried pgperl for string parsing, but it takes much more time than our > current solution. Also we are thinking about self-written C++ function, may > be someone had implemented this algorithm before? > you can use indexes or you can use hstore Regards Pavel Stehule > -- > Best regards > > Aleksej Trofimov > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres array parser
We have tried foreach syntax, but we have noticed performance degradation: Function with for: 203ms Function with foreach: ~250ms: there is functions code: CREATE OR REPLACE FUNCTION input_value_fe(in_inputs numeric[], in_input_nr numeric) RETURNS numeric AS $BODY$ declare i numeric[]; BEGIN FOREACH i SLICE 1 IN ARRAY in_inputs LOOP if i[1] = in_input_nr then return i[2]; end if; END LOOP; return null; END; $BODY$ LANGUAGE plpgsql VOLATILE COST 100; CREATE OR REPLACE FUNCTION input_value(in_inputs numeric[], in_input_nr numeric) RETURNS numeric AS $BODY$ declare size int; BEGIN size = array_upper(in_inputs, 1); IF size IS NOT NULL THEN FOR i IN 1 .. size LOOP if in_inputs[i][1] = in_input_nr then return in_inputs[i][2]; end if; END LOOP; END IF; return null; END; $BODY$ LANGUAGE plpgsql VOLATILE COST 100; On 12/13/2011 04:02 PM, Pavel Stehule wrote: Hello do you know FOREACH IN ARRAY statement in 9.1 this significantly accelerate iteration over array http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-foreach-in-array/ 2011/12/13 Aleksej Trofimov: Hello, I wanted to ask according such a problem which we had faced with. We are widely using postgres arrays like key->value array by doing like this: {{1,5},{2,6},{3,7}} where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions we are using self written array_input(array::numeric[], key::numeric) function which makes a loop on whole array and searches for key like FOR i IN 1 .. size LOOP if array[i][1] = key then return array[i][2]; end if; END LOOP; But this was a good solution until our arrays and database had grown. So now FOR loop takes a lot of time to find value of an array. And my question is, how this problem of performance could be solved? We had tried pgperl for string parsing, but it takes much more time than our current solution. Also we are thinking about self-written C++ function, may be someone had implemented this algorithm before? you can use indexes or you can use hstore Regards Pavel Stehule -- Best regards Aleksej Trofimov -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Best regards Aleksej Trofimov UAB "Ruptela" Phone: +370 657 80475 E-Mail: aleksej.trofi...@ruptela.lt Web:http://www.ruptela.lt Ruptela - the most successful IT company in Lithuania 2011 Ruptela - sekmingiausia Lietuvos aukštųjų technologijų įmonė 2011 http://www.prezidentas.lt/lt/spaudos_centras_392/pranesimai_spaudai/inovatyvus_verslas_-_konkurencingos_lietuvos_pagrindas.html http://www.ruptela.lt/news/37/121/Ruptela-sekmingiausia-jauna-aukstuju-technologiju-imone-Lietuvoje -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] copy vs. C function
On Mon, Dec 12, 2011 at 10:38 AM, Merlin Moncure wrote: > On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson wrote: >> I was experimenting with a few different methods of taking a line of >> text, parsing it, into a set of fields, and then getting that info >> into a table. >> >> The first method involved writing a C program to parse a file, parse >> the lines and output newly-formatted lines in a format that >> postgresql's COPY function can use. >> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, >> output new data to new file -- 4 seconds, COPY new file -- 10 >> seconds). >> >> The next approach I took was to write a C function in postgresql to >> parse a single TEXT datum into an array of C strings, and then use >> BuildTupleFromCStrings. There are 8 columns involved. >> Eliding the time it takes to COPY the (raw) file into a temporary >> table, this method took 120 seconds, give or take. >> >> The difference was /quite/ a surprise to me. What is the probability >> that I am doing something very, very wrong? >> >> NOTE: the code that does the parsing is actually the same, >> line-for-line, the only difference is whether the routine is called by >> a postgresql function or by a C program via main, so obviously the >> overhead is elsewhere. >> NOTE #2: We are talking about approximately 2.6 million lines. > > > Let me throw out an interesting third method I've been using to parse > delimited text files that might be useful in your case. This is > useful when parsing text that is bad csv where values are not escaped > or there are lines, incomplete and/or missing records, or a huge > amount of columns that you want to rotate into a more normalized > structure based on columns position. > > 1. Import the data into a single column (containing the entire line) > staging table, feeding the COPY parser a bogus delimiter > 2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function). > 3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if > you can work it into your problem, INSERT/SELECT, expanding the array > with a trick like used in information_schema._pg_expandarray so you > can hook logic on the array (column position). If you replace [2] with my C function (which can process all of the data, *postgresql overhead not included*, in about 1 second) then that's what I did. It returns a composite type making [3] unnecessary. I know it's not parsing, so I started a time honored debugging approach: I returned early. Is the function-call overhead that high? That's outrageously high. What else could it be? Is returning a composite type outragously expensive? So here is what I did: I modified the code so that it immediately returns NULL. Result: 2 seconds. Extract arguments, allocate temporary work buffer: another 0.5 seconds. Add parsing: another 1.5 seconds [total: 4.1 seconds] and so on... Two of the items require base conversion, so: Calling strtol (twice) and snprintf (twice) -- adds *6 seconds. and to format one of the items as an array (a strcpy and a strcat) -- add 1.5 seconds for a total of 11.5. The only thing I have left are these statements: get_call_result_type TupleDescGetAttInMetadata BuildTupleFromCStrings HeapTupleGetDatum and finally PG_RETURN_DATUM It turns out that: get_call_result_type adds 43 seconds [total: 54], TupleDescGetAttInMetadata adds 19 seconds [total: 73], BuildTypleFromCStrings accounts for 43 seconds [total: 116]. So those three functions account for 90% of the total time spent. What alternatives exist? Do I have to call get_call_result_type /every time/ through the function? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres array parser
Hello 2011/12/13 Aleksej Trofimov : > We have tried foreach syntax, but we have noticed performance degradation: > Function with for: 203ms > Function with foreach: ~250ms: > > there is functions code: > CREATE OR REPLACE FUNCTION input_value_fe(in_inputs numeric[], in_input_nr > numeric) > RETURNS numeric AS > $BODY$ > declare i numeric[]; > BEGIN > FOREACH i SLICE 1 IN ARRAY in_inputs > LOOP > if i[1] = in_input_nr then > return i[2]; > end if; > END LOOP; > > return null; > END; > $BODY$ > LANGUAGE plpgsql VOLATILE > COST 100; > > CREATE OR REPLACE FUNCTION input_value(in_inputs numeric[], in_input_nr > numeric) > RETURNS numeric AS > $BODY$ > declare > size int; > BEGIN > size = array_upper(in_inputs, 1); > IF size IS NOT NULL THEN > > FOR i IN 1 .. size LOOP > if in_inputs[i][1] = in_input_nr then > return in_inputs[i][2]; > end if; > END LOOP; > END IF; > > return null; > END; > $BODY$ > LANGUAGE plpgsql VOLATILE > COST 100; > > > On 12/13/2011 04:02 PM, Pavel Stehule wrote: >> >> Hello >> >> do you know FOREACH IN ARRAY statement in 9.1 >> >> this significantly accelerate iteration over array >> >> >> http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-foreach-in-array/ >> >> >> >> 2011/12/13 Aleksej Trofimov: >>> >>> Hello, I wanted to ask according such a problem which we had faced with. >>> We are widely using postgres arrays like key->value array by doing like >>> this: >>> >>> {{1,5},{2,6},{3,7}} >>> >>> where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions we are >>> using self written array_input(array::numeric[], key::numeric) function >>> which makes a loop on whole array and searches for key like >>> FOR i IN 1 .. size LOOP >>> if array[i][1] = key then >>> return array[i][2]; >>> end if; >>> END LOOP; >>> >>> But this was a good solution until our arrays and database had grown. So >>> now >>> FOR loop takes a lot of time to find value of an array. >>> >>> And my question is, how this problem of performance could be solved? We >>> had >>> tried pgperl for string parsing, but it takes much more time than our >>> current solution. Also we are thinking about self-written C++ function, >>> may >>> be someone had implemented this algorithm before? >>> >> you can use indexes or you can use hstore >> >> Regards >> >> Pavel Stehule >> >>> -- >>> Best regards >>> >>> Aleksej Trofimov >>> >>> >>> -- >>> Sent via pgsql-performance mailing list >>> (pgsql-performance@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance > > It is strange - on my comp FOREACH is about 2x faster postgres=# select input_value(array(select generate_series(1,100)::numeric), 10); input_value - (1 row) Time: 495.426 ms postgres=# select input_value_fe(array(select generate_series(1,100)::numeric), 10); input_value_fe (1 row) Time: 248.980 ms Regards Pavel > > -- > Best regards > > Aleksej Trofimov > > UAB "Ruptela" > > Phone: +370 657 80475 > > E-Mail: aleksej.trofi...@ruptela.lt > Web: http://www.ruptela.lt > > Ruptela - the most successful IT company in Lithuania 2011 > Ruptela - sekmingiausia Lietuvos aukštųjų technologijų įmonė 2011 > http://www.prezidentas.lt/lt/spaudos_centras_392/pranesimai_spaudai/inovatyvus_verslas_-_konkurencingos_lietuvos_pagrindas.html > http://www.ruptela.lt/news/37/121/Ruptela-sekmingiausia-jauna-aukstuju-technologiju-imone-Lietuvoje > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] select distinct uses index scan vs full table scan
I've got a 5GB table with about 12 million rows. Recently, I had to select the distinct values from just one column. The planner chose an index scan. The query took almost an hour. When I forced index scan off, the query took 90 seconds (full table scan). The planner estimated 70,000 unique values when, in fact, there are 12 million (the value for this row is *almost* but not quite unique). What's more, despite bumping the statistics on that column up to 1000 and re-analyzing, the planner now thinks that there are 300,000 unique values. How can I tell the planner that a given column is much more unique than, apparently, it thinks it is? The column type is INET. This is on PG 8.4.10 on Linux x86_64, with 81f4e6cd27d538bc27e9714a9173e4df353a02e5 applied. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] select distinct uses index scan vs full table scan
Jon Nelson writes: > I've got a 5GB table with about 12 million rows. > Recently, I had to select the distinct values from just one column. > The planner chose an index scan. The query took almost an hour. > When I forced index scan off, the query took 90 seconds (full table scan). Usually, we hear complaints about the opposite. Are you using nondefault cost settings? > The planner estimated 70,000 unique values when, in fact, there are 12 > million (the value for this row is *almost* but not quite unique). > What's more, despite bumping the statistics on that column up to 1000 > and re-analyzing, the planner now thinks that there are 300,000 unique > values. Accurate ndistinct estimates are hard, but that wouldn't have much of anything to do with this particular choice, AFAICS. > How can I tell the planner that a given column is much more unique > than, apparently, it thinks it is? 9.0 and up have ALTER TABLE ... ALTER COLUMN ... SET n_distinct. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] select distinct uses index scan vs full table scan
On Tue, Dec 13, 2011 at 1:57 PM, Tom Lane wrote: > Jon Nelson writes: >> I've got a 5GB table with about 12 million rows. >> Recently, I had to select the distinct values from just one column. >> The planner chose an index scan. The query took almost an hour. >> When I forced index scan off, the query took 90 seconds (full table scan). > > Usually, we hear complaints about the opposite. Are you using > nondefault cost settings? Cost settings had not been changed until a few minutes ago when your response prompted me to try a few things. I ended up changing the random_page_cost to 16.0 (from 4.0), partly because the H/W raid I'm using is awful bad at random I/O. I'll experiment and keep tabs on performance to see if this has a negative effect on other aspects. >> The planner estimated 70,000 unique values when, in fact, there are 12 >> million (the value for this row is *almost* but not quite unique). >> What's more, despite bumping the statistics on that column up to 1000 >> and re-analyzing, the planner now thinks that there are 300,000 unique >> values. > > Accurate ndistinct estimates are hard, but that wouldn't have much of > anything to do with this particular choice, AFAICS. > >> How can I tell the planner that a given column is much more unique >> than, apparently, it thinks it is? > > 9.0 and up have ALTER TABLE ... ALTER COLUMN ... SET n_distinct. D'oh! I'm on 8.4.10+patches. This may provide the necessary push. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query after upgrade from 8.2 to 8.4
On 10/12/11 04:30, Tom Lane wrote: However, it's not apparent to me why you would see any difference between 8.2 and 8.4 on this type of query. I tried a query analogous to this one on both, and got identical plans. I'm guessing that your slowdown is due to not having updated statistics on the new installation, or perhaps failing to duplicate some relevant settings. I notice he has 8.4.*8*... I wonder if he's running into the poor estimation bug for sub-selects/semi joins that was fixed in 8.4.9. Kaloyan, can you try the query in 8.4.9? regards Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] copy vs. C function
Jon Nelson writes: > The only thing I have left are these statements: > get_call_result_type > TupleDescGetAttInMetadata > BuildTupleFromCStrings > HeapTupleGetDatum > and finally PG_RETURN_DATUM > It turns out that: > get_call_result_type adds 43 seconds [total: 54], > TupleDescGetAttInMetadata adds 19 seconds [total: 73], > BuildTypleFromCStrings accounts for 43 seconds [total: 116]. > So those three functions account for 90% of the total time spent. > What alternatives exist? Do I have to call get_call_result_type /every > time/ through the function? Well, if you're concerned about performance then I think you're going about this in entirely the wrong way, because as far as I can tell from this you're converting all the field values to text and back again. You should be trying to keep the values in Datum format and then invoking heap_form_tuple. And yeah, you probably could cache the type information across calls. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] copy vs. C function
Hi guys, A nub question here since I could not figure it out on my own: I'm using Hamachi to connect different sites into a VPN and their address always starts with 5.*.*.* - the problem I'm facing is that I cannot make the access restricted to that particular range only. Currently I got : host all all 0.0.0.0/32 md5 which allows all the IP's, and the try: host all all 5.0.0.0/32 md5 does not work. So what I am suppose to add in "pg_hba.conf" in order to achieve my restriction? Please help me, Thank you, Danny From: Tom Lane To: Jon Nelson Cc: pgsql-performance@postgresql.org Sent: Wednesday, December 14, 2011 8:18 AM Subject: Re: [PERFORM] copy vs. C function Jon Nelson writes: > The only thing I have left are these statements: > get_call_result_type > TupleDescGetAttInMetadata > BuildTupleFromCStrings > HeapTupleGetDatum > and finally PG_RETURN_DATUM > It turns out that: > get_call_result_type adds 43 seconds [total: 54], > TupleDescGetAttInMetadata adds 19 seconds [total: 73], > BuildTypleFromCStrings accounts for 43 seconds [total: 116]. > So those three functions account for 90% of the total time spent. > What alternatives exist? Do I have to call get_call_result_type /every > time/ through the function? Well, if you're concerned about performance then I think you're going about this in entirely the wrong way, because as far as I can tell from this you're converting all the field values to text and back again. You should be trying to keep the values in Datum format and then invoking heap_form_tuple. And yeah, you probably could cache the type information across calls. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance