date:20111213

[PERFORM] Postgres array parser

2011-12-13 Thread Aleksej Trofimov


Hello, I wanted to ask according such a problem which we had faced with.
We are widely using postgres arrays like key->value array by doing like 
this:


{{1,5},{2,6},{3,7}}

where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions we 
are using self written array_input(array::numeric[], key::numeric) 
function which makes a loop on whole array and searches for key like

FOR i IN 1 .. size LOOP
if array[i][1] = key then
return array[i][2];
end if;
END LOOP;

But this was a good solution until our arrays and database had grown. So 
now FOR loop takes a lot of time to find value of an array.


And my question is, how this problem of performance could be solved? We 
had tried pgperl for string parsing, but it takes much more time than 
our current solution. Also we are thinking about self-written C++ 
function, may be someone had implemented this algorithm before?


--
Best regards

Aleksej Trofimov


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Postgres array parser

2011-12-13 Thread Pavel Stehule

Hello

do you know FOREACH IN ARRAY statement in 9.1

this significantly accelerate iteration over array

http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-foreach-in-array/



2011/12/13 Aleksej Trofimov :
> Hello, I wanted to ask according such a problem which we had faced with.
> We are widely using postgres arrays like key->value array by doing like
> this:
>
> {{1,5},{2,6},{3,7}}
>
> where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions we are
> using self written array_input(array::numeric[], key::numeric) function
> which makes a loop on whole array and searches for key like
> FOR i IN 1 .. size LOOP
>            if array[i][1] = key then
>                return array[i][2];
>            end if;
> END LOOP;
>
> But this was a good solution until our arrays and database had grown. So now
> FOR loop takes a lot of time to find value of an array.
>
> And my question is, how this problem of performance could be solved? We had
> tried pgperl for string parsing, but it takes much more time than our
> current solution. Also we are thinking about self-written C++ function, may
> be someone had implemented this algorithm before?
>

you can use indexes or you can use hstore

Regards

Pavel Stehule

> --
> Best regards
>
> Aleksej Trofimov
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Postgres array parser

2011-12-13 Thread Aleksej Trofimov


We have tried foreach syntax, but we have noticed performance degradation:
Function with for: 203ms
Function with foreach: ~250ms:

there is functions code:
CREATE OR REPLACE FUNCTION input_value_fe(in_inputs numeric[], 
in_input_nr numeric)

  RETURNS numeric AS
$BODY$
declare i numeric[];
BEGIN
FOREACH i SLICE 1 IN ARRAY in_inputs
LOOP
 if i[1] = in_input_nr then
return i[2];
 end if;
END LOOP;

return null;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

CREATE OR REPLACE FUNCTION input_value(in_inputs numeric[], in_input_nr 
numeric)

  RETURNS numeric AS
$BODY$
declare
  size int;
BEGIN
  size = array_upper(in_inputs, 1);
IF size IS NOT NULL THEN
FOR i IN 1 .. size LOOP
if in_inputs[i][1] = in_input_nr then
return in_inputs[i][2];
end if;
END LOOP;
END IF;

return null;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

On 12/13/2011 04:02 PM, Pavel Stehule wrote:

Hello

do you know FOREACH IN ARRAY statement in 9.1

this significantly accelerate iteration over array

http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-foreach-in-array/



2011/12/13 Aleksej Trofimov:

Hello, I wanted to ask according such a problem which we had faced with.
We are widely using postgres arrays like key->value array by doing like
this:

{{1,5},{2,6},{3,7}}

where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions we are
using self written array_input(array::numeric[], key::numeric) function
which makes a loop on whole array and searches for key like
FOR i IN 1 .. size LOOP
if array[i][1] = key then
return array[i][2];
end if;
END LOOP;

But this was a good solution until our arrays and database had grown. So now
FOR loop takes a lot of time to find value of an array.

And my question is, how this problem of performance could be solved? We had
tried pgperl for string parsing, but it takes much more time than our
current solution. Also we are thinking about self-written C++ function, may
be someone had implemented this algorithm before?


you can use indexes or you can use hstore

Regards

Pavel Stehule


--
Best regards

Aleksej Trofimov


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--
Best regards

Aleksej Trofimov

UAB "Ruptela"

Phone: +370 657 80475

E-Mail: aleksej.trofi...@ruptela.lt
Web:http://www.ruptela.lt

Ruptela - the most successful IT company in Lithuania 2011
Ruptela - sekmingiausia Lietuvos aukštųjų technologijų įmonė 2011
http://www.prezidentas.lt/lt/spaudos_centras_392/pranesimai_spaudai/inovatyvus_verslas_-_konkurencingos_lietuvos_pagrindas.html
http://www.ruptela.lt/news/37/121/Ruptela-sekmingiausia-jauna-aukstuju-technologiju-imone-Lietuvoje


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] copy vs. C function

2011-12-13 Thread Jon Nelson

On Mon, Dec 12, 2011 at 10:38 AM, Merlin Moncure  wrote:
> On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson  wrote:
>> I was experimenting with a few different methods of taking a line of
>> text, parsing it, into a set of fields, and then getting that info
>> into a table.
>>
>> The first method involved writing a C program to parse a file, parse
>> the lines and output newly-formatted lines in a format that
>> postgresql's COPY function can use.
>> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
>> output new data to new file -- 4 seconds, COPY new file -- 10
>> seconds).
>>
>> The next approach I took was to write a C function in postgresql to
>> parse a single TEXT datum into an array of C strings, and then use
>> BuildTupleFromCStrings. There are 8 columns involved.
>> Eliding the time it takes to COPY the (raw) file into a temporary
>> table, this method took 120 seconds, give or take.
>>
>> The difference was /quite/ a surprise to me. What is the probability
>> that I am doing something very, very wrong?
>>
>> NOTE: the code that does the parsing is actually the same,
>> line-for-line, the only difference is whether the routine is called by
>> a postgresql function or by a C program via main, so obviously the
>> overhead is elsewhere.
>> NOTE #2: We are talking about approximately 2.6 million lines.
>
>
> Let me throw out an interesting third method I've been using to parse
> delimited text files that might be useful in your case.  This is
> useful when parsing text that is bad csv where values are not escaped
> or there are lines, incomplete and/or missing records, or a huge
> amount of columns that you want to rotate into a more normalized
> structure based on columns position.
>
> 1. Import the data into a single column (containing the entire line)
> staging table, feeding the COPY parser a bogus delimiter
> 2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function).
> 3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if
> you can work it into your problem, INSERT/SELECT, expanding the array
> with a trick like used in information_schema._pg_expandarray so you
> can hook logic on the array (column position).

If you replace [2] with my C function (which can process all of the
data, *postgresql overhead not included*, in about 1 second) then
that's what I did. It returns a composite type making [3] unnecessary.

I know it's not parsing, so I started a time honored debugging
approach: I returned early.

Is the function-call overhead that high? That's outrageously high.
What else could it be? Is returning a composite type outragously
expensive?
So here is what I did: I modified the code so that it immediately returns NULL.
Result: 2 seconds.
Extract arguments, allocate temporary work buffer: another 0.5 seconds.
Add parsing: another 1.5 seconds [total: 4.1 seconds]

and so on...

Two of the items require base conversion, so:
Calling strtol (twice) and snprintf (twice) -- adds *6 seconds.

and to format one of the items as an array (a strcpy and a strcat) --
add 1.5 seconds for a total of 11.5.

The only thing I have left are these statements:

get_call_result_type
TupleDescGetAttInMetadata
BuildTupleFromCStrings
HeapTupleGetDatum
and finally PG_RETURN_DATUM

It turns out that:
get_call_result_type adds 43 seconds [total: 54],
TupleDescGetAttInMetadata adds 19 seconds [total: 73],
BuildTypleFromCStrings accounts for 43 seconds [total: 116].

So those three functions account for 90% of the total time spent.
What alternatives exist? Do I have to call get_call_result_type /every
time/ through the function?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Postgres array parser

2011-12-13 Thread Pavel Stehule

Hello

2011/12/13 Aleksej Trofimov :
> We have tried foreach syntax, but we have noticed performance degradation:
> Function with for: 203ms
> Function with foreach: ~250ms:
>
> there is functions code:
> CREATE OR REPLACE FUNCTION input_value_fe(in_inputs numeric[], in_input_nr
> numeric)
>  RETURNS numeric AS
> $BODY$
> declare i numeric[];
> BEGIN
>        FOREACH i SLICE 1 IN ARRAY in_inputs
>            LOOP
>                 if i[1] = in_input_nr then
>                    return i[2];
>                 end if;
>            END LOOP;
>
>    return null;
> END;
> $BODY$
>  LANGUAGE plpgsql VOLATILE
>  COST 100;
>
> CREATE OR REPLACE FUNCTION input_value(in_inputs numeric[], in_input_nr
> numeric)
>  RETURNS numeric AS
> $BODY$
> declare
>  size int;
> BEGIN
>  size = array_upper(in_inputs, 1);
>    IF size IS NOT NULL THEN
>
>        FOR i IN 1 .. size LOOP
>            if in_inputs[i][1] = in_input_nr then
>                return in_inputs[i][2];
>            end if;
>        END LOOP;
>    END IF;
>
>    return null;
> END;
> $BODY$
>  LANGUAGE plpgsql VOLATILE
>  COST 100;
>
>
> On 12/13/2011 04:02 PM, Pavel Stehule wrote:
>>
>> Hello
>>
>> do you know FOREACH IN ARRAY statement in 9.1
>>
>> this significantly accelerate iteration over array
>>
>>
>> http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-foreach-in-array/
>>
>>
>>
>> 2011/12/13 Aleksej Trofimov:
>>>
>>> Hello, I wanted to ask according such a problem which we had faced with.
>>> We are widely using postgres arrays like key->value array by doing like
>>> this:
>>>
>>> {{1,5},{2,6},{3,7}}
>>>
>>> where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions we are
>>> using self written array_input(array::numeric[], key::numeric) function
>>> which makes a loop on whole array and searches for key like
>>> FOR i IN 1 .. size LOOP
>>>            if array[i][1] = key then
>>>                return array[i][2];
>>>            end if;
>>> END LOOP;
>>>
>>> But this was a good solution until our arrays and database had grown. So
>>> now
>>> FOR loop takes a lot of time to find value of an array.
>>>
>>> And my question is, how this problem of performance could be solved? We
>>> had
>>> tried pgperl for string parsing, but it takes much more time than our
>>> current solution. Also we are thinking about self-written C++ function,
>>> may
>>> be someone had implemented this algorithm before?
>>>
>> you can use indexes or you can use hstore
>>
>> Regards
>>
>> Pavel Stehule
>>
>>> --
>>> Best regards
>>>
>>> Aleksej Trofimov
>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list
>>> (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>
>

It is strange - on my comp FOREACH is about 2x faster

postgres=# select input_value(array(select
generate_series(1,100)::numeric), 10);
 input_value
-

(1 row)

Time: 495.426 ms

postgres=# select input_value_fe(array(select
generate_series(1,100)::numeric), 10);
 input_value_fe


(1 row)

Time: 248.980 ms

Regards

Pavel


>
> --
> Best regards
>
> Aleksej Trofimov
>
> UAB "Ruptela"
>
> Phone: +370 657 80475
>
> E-Mail: aleksej.trofi...@ruptela.lt
> Web:    http://www.ruptela.lt
>
> Ruptela - the most successful IT company in Lithuania 2011
> Ruptela - sekmingiausia Lietuvos aukštųjų technologijų įmonė 2011
> http://www.prezidentas.lt/lt/spaudos_centras_392/pranesimai_spaudai/inovatyvus_verslas_-_konkurencingos_lietuvos_pagrindas.html
> http://www.ruptela.lt/news/37/121/Ruptela-sekmingiausia-jauna-aukstuju-technologiju-imone-Lietuvoje
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] select distinct uses index scan vs full table scan

2011-12-13 Thread Jon Nelson

I've got a 5GB table with about 12 million rows.
Recently, I had to select the distinct values from just one column.
The planner chose an index scan. The query took almost an hour.
When I forced index scan off, the query took 90 seconds (full table scan).

The planner estimated 70,000 unique values when, in fact, there are 12
million (the value for this row is *almost* but not quite unique).
What's more, despite bumping the statistics on that column up to 1000
and re-analyzing, the planner now thinks that there are 300,000 unique
values.
How can I tell the planner that a given column is much more unique
than, apparently, it thinks it is?
The column type is INET.
This is on PG 8.4.10 on Linux x86_64, with
81f4e6cd27d538bc27e9714a9173e4df353a02e5 applied.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] select distinct uses index scan vs full table scan

2011-12-13 Thread Tom Lane

Jon Nelson  writes:
> I've got a 5GB table with about 12 million rows.
> Recently, I had to select the distinct values from just one column.
> The planner chose an index scan. The query took almost an hour.
> When I forced index scan off, the query took 90 seconds (full table scan).

Usually, we hear complaints about the opposite.  Are you using
nondefault cost settings?

> The planner estimated 70,000 unique values when, in fact, there are 12
> million (the value for this row is *almost* but not quite unique).
> What's more, despite bumping the statistics on that column up to 1000
> and re-analyzing, the planner now thinks that there are 300,000 unique
> values.

Accurate ndistinct estimates are hard, but that wouldn't have much of
anything to do with this particular choice, AFAICS.

> How can I tell the planner that a given column is much more unique
> than, apparently, it thinks it is?

9.0 and up have ALTER TABLE ... ALTER COLUMN ... SET n_distinct.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] select distinct uses index scan vs full table scan

2011-12-13 Thread Jon Nelson

On Tue, Dec 13, 2011 at 1:57 PM, Tom Lane  wrote:
> Jon Nelson  writes:
>> I've got a 5GB table with about 12 million rows.
>> Recently, I had to select the distinct values from just one column.
>> The planner chose an index scan. The query took almost an hour.
>> When I forced index scan off, the query took 90 seconds (full table scan).
>
> Usually, we hear complaints about the opposite.  Are you using
> nondefault cost settings?

Cost settings had not been changed until a few minutes ago when your
response prompted me to try a few things.

I ended up changing the random_page_cost to 16.0 (from 4.0), partly
because the H/W raid I'm using is awful bad at random I/O. I'll
experiment and keep tabs on performance to see if this has a negative
effect on other aspects.

>> The planner estimated 70,000 unique values when, in fact, there are 12
>> million (the value for this row is *almost* but not quite unique).
>> What's more, despite bumping the statistics on that column up to 1000
>> and re-analyzing, the planner now thinks that there are 300,000 unique
>> values.
>
> Accurate ndistinct estimates are hard, but that wouldn't have much of
> anything to do with this particular choice, AFAICS.
>
>> How can I tell the planner that a given column is much more unique
>> than, apparently, it thinks it is?
>
> 9.0 and up have ALTER TABLE ... ALTER COLUMN ... SET n_distinct.

D'oh!  I'm on 8.4.10+patches.
This may provide the necessary push.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Slow query after upgrade from 8.2 to 8.4

2011-12-13 Thread Mark Kirkwood


On 10/12/11 04:30, Tom Lane wrote:
However, it's not apparent to me why you would see any difference 
between 8.2 and 8.4 on this type of query. I tried a query analogous 
to this one on both, and got identical plans. I'm guessing that your 
slowdown is due to not having updated statistics on the new 
installation, or perhaps failing to duplicate some relevant settings.


I notice he has 8.4.*8*... I wonder if he's running into the poor 
estimation bug for sub-selects/semi joins that was fixed in 8.4.9.


Kaloyan, can you try the query in 8.4.9?

regards

Mark

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] copy vs. C function

2011-12-13 Thread Tom Lane

Jon Nelson  writes:
> The only thing I have left are these statements:

> get_call_result_type
> TupleDescGetAttInMetadata
> BuildTupleFromCStrings
> HeapTupleGetDatum
> and finally PG_RETURN_DATUM

> It turns out that:
> get_call_result_type adds 43 seconds [total: 54],
> TupleDescGetAttInMetadata adds 19 seconds [total: 73],
> BuildTypleFromCStrings accounts for 43 seconds [total: 116].

> So those three functions account for 90% of the total time spent.
> What alternatives exist? Do I have to call get_call_result_type /every
> time/ through the function?

Well, if you're concerned about performance then I think you're going
about this in entirely the wrong way, because as far as I can tell from
this you're converting all the field values to text and back again.
You should be trying to keep the values in Datum format and then
invoking heap_form_tuple.  And yeah, you probably could cache the
type information across calls.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] copy vs. C function

2011-12-13 Thread idc danny

Hi guys,
A nub question here since I could not figure it out on my own:
I'm using Hamachi to connect different sites into a VPN and their address 
always starts with 5.*.*.* - the problem I'm facing is that I cannot make the 
access restricted to that particular range only.
Currently I got :
host all all 0.0.0.0/32 md5
which allows all the IP's, and the try:
host all all 5.0.0.0/32 md5 

does not work.
So what I am suppose to add in "pg_hba.conf" in order to achieve my 
restriction? Please help me,
Thank you,
Danny

 From: Tom Lane 
To: Jon Nelson  
Cc: pgsql-performance@postgresql.org 
Sent: Wednesday, December 14, 2011 8:18 AM
Subject: Re: [PERFORM] copy vs. C function 

Jon Nelson  writes:
> The only thing I have left are these statements:

> get_call_result_type
> TupleDescGetAttInMetadata
> BuildTupleFromCStrings
> HeapTupleGetDatum
> and finally PG_RETURN_DATUM

> It turns out that:
> get_call_result_type adds 43 seconds [total: 54],
> TupleDescGetAttInMetadata adds 19 seconds [total: 73],
> BuildTypleFromCStrings accounts for 43 seconds [total: 116].

> So those three functions account for 90% of the total time spent.
> What alternatives exist? Do I have to call get_call_result_type /every
> time/ through the function?

Well, if you're concerned about performance then I think you're going
about this in entirely the wrong way, because as far as I can tell from
this you're converting all the field values to text and back again.
You should be trying to keep the values in Datum format and then
invoking heap_form_tuple.  And yeah, you probably could cache the
type information across calls.

            regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Postgres array parser

Re: [PERFORM] Postgres array parser

Re: [PERFORM] Postgres array parser

Re: [PERFORM] copy vs. C function

Re: [PERFORM] Postgres array parser

[PERFORM] select distinct uses index scan vs full table scan

Re: [PERFORM] select distinct uses index scan vs full table scan

Re: [PERFORM] select distinct uses index scan vs full table scan

Re: [PERFORM] Slow query after upgrade from 8.2 to 8.4

Re: [PERFORM] copy vs. C function

Re: [PERFORM] copy vs. C function

11 matches

Site Navigation

Mail list logo

Footer information