Re: [PERFORM] Queries with conditions using bitand operator

2010-07-14 Thread valgog
One of the possibilities would be to decompose your bitmap into an
array of base integers and then create a GIN (or GIST) index on that
array (intarray contrib package). This would make sense if your
articles are distributed relatively equally and if do not do big ORDER
BY and then LIMIT/OFFSET queries, that usually will need to sort the
results gotten from the GIN index.
As your are also probably doing some tsearch queries on the articles,
you can actually build combined (tverctor, intarray) GIN/GIST index to
optimize your searches.

A simple function, that can help you stripping your bitmap integer to
array of positions could look like:

-- DROP FUNCTION utils.bitmap_to_position_intarray(bitmap integer);

CREATE OR REPLACE FUNCTION utils.bitmap_to_position_intarray(bitmap
integer)
  RETURNS integer[] AS
$BODY$
-- test
-- select utils.bitmap_to_position_intarray(5);
-- test performance
-- select utils.bitmap_to_position_intarray(s.i) from
generate_series(1, 1) as s(i);
--

SELECT ARRAY(
  SELECT s.i + 1 -- here we do +1 to make the position of the first
bit 1
FROM generate_series(0, 31) as s(i)
   WHERE $1  ( 1  s.i )  0
  );
$BODY$
  LANGUAGE SQL IMMUTABLE STRICT;

You can create a GIN index directly using this function over your
bitmap field and then using array set operations will make the planner
to use the GIN index (more information about these indexes here:
http://www.postgresql.org/docs/8.4/interactive/textsearch-indexes.html):

CREATE INDEX idx_article_status_gin ON article USING
gin( (utils.bitmap_to_position_intarray(STATUS) ) );

and then you can do:

SELECT * FROM article WHERE utils.bitmap_to_position_intarray(STATUS)
 ARRAY[1,5];

or

SELECT * FROM article WHERE utils.bitmap_to_position_intarray(STATUS)
 utils.bitmap_to_position_intarray(5);

Have a look on the possible array set operations in
http://www.postgresql.org/docs/8.4/interactive/intarray.html.

Otherwise a solution from Jeo Conway to create separate indexes for
each bit also is worth to be looked up. This has actually drawbacks,
that you cannot look up combinations of bits efficiently. As an
advantage in the example from Jeo, you can efficiently do ORDER BY
article (or any other field, that you add into these limited
indexes).


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Full text search with ORDER BY performance issue

2009-07-21 Thread valgog
On Jul 21, 6:06 am, scott.marl...@gmail.com (Scott Marlowe) wrote:
 On Mon, Jul 20, 2009 at 9:35 PM, Kradekr...@krade.com wrote:
  But I think I might just do:
  select * from a where comment_tsv @@ plainto_tsquery('query') and timestamp
  cast(floor(extract(epoch from CURRENT_TIMESTAMP) - 864000) as integer)
  order by timestamp desc limit 24 offset 0;

  And if I get less than 24 rows, issue the regular query:

  select * from a where comment_tsv @@ plainto_tsquery('query') order by
  timestamp desc limit 24 offset 0;

 Couldn't you do tge second query as a with query then run another
 query to limit that result to everything greater than now()-xdays ?

 --
 Sent via pgsql-performance mailing list (pgsql-performa...@postgresql.org)
 To make changes to your 
 subscription:http://www.postgresql.org/mailpref/pgsql-performance

Hi,

There is a problem with GIN and GIST indexes, that they cannot be used
by the ORDER BY. Maybe it will be a nice idea to ask Oleg to make it
possible to use the b-tree columns in GIST or GIN to make the sort
easier, but I have no idea how difficult it will be to implement it in
current GIN or GIST structures. I think Oleg or even Tom will be the
right people to ask it :) But even if it is possible it will not be
implemented at least until 8.5 that will need a year to come, so until
then...

It is possible to strip your table in several smaller ones putting
them on different machines and then splitting your query with DBLINK.
This will distribute the burden of sorting to several machines that
will have to sort smaller parts as well. After you have your 25 ids
from each of the machines, you can merge them, sort again and limit as
you wish. Doing large offsets will be still problematic but faster
anyway in most reasonable offset ranges. (Load balancing tools like
pg_pool can automate this task, but I do not have practical experience
using them for that purposes)

Yet another very interesting technology -- sphinx search (http://
www.sphinxsearch.com/). It can distribute data on several machines
automatically, but it will be probably too expensive to start using
(if your task is not your main one :)) as they do not have standard
automation scripts, it does not support live updates (so you will
always have some minutes delay), and this is a standalone service,
that needs to be maintained and configured and synchronized with our
main database separately (though you can use pg/python to access it
from postgres).

Good luck with your task :)

-- Valentine Gogichashvili

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Terrible Write Performance of a Stored Procedure

2009-06-30 Thread valgog
On Jun 26, 9:30 pm, goofyheadedp...@gmail.com (Brian Troutwine) wrote:
 Hello, all.

  CREATE OR REPLACE FUNCTION item_data_insert(
         iasin TEXT, iauthor TEXT, ibinding TEXT, icurrency_code TEXT,
         iisbn TEXT, iheight INTEGER, iwidth INTEGER, ilength INTEGER,
 iweight INTEGER,
         ilist_price INTEGER, iproduct_group TEXT, isales_rank INTEGER,
         ititle TEXT, ioffer_pages INTEGER, ioffers TEXT)
  RETURNS VOID AS
  $$
  DECLARE
          y  integer[];
  BEGIN
          y[1] := iwidth;
          y[2] := ilength;
          y[3] := iheight;
          y[4] := iweight;
  BEGIN
          INSERT INTO item_details
                  (isbn, title, author, binding, list_price, dimensions)
                  VALUES
                  (iisbn, ititle, iauthor, ibinding, ilist_price, y);
          EXCEPTION WHEN unique_violation THEN
                  UPDATE item_details SET
                         title = ititle,
                         author = iauthor,
                         binding = ibinding,
                         list_price = ilist_price,
                         dimensions = y
                  WHERE isbn = iisbn;
          END;
          BEGIN
                  INSERT INTO amazon_items
                  (asin, sales_rank, offers, offer_pages, isbn)
                  VALUES
                  (iasin, isales_rank, crunch(ioffers), ioffer_pages, iisbn);
          EXCEPTION WHEN unique_violation THEN
                  IF isales_rank IS NOT NULL THEN
                     UPDATE amazon_items SET
                         sales_rank = isales_rank
                     WHERE asin = iasin;
                  END IF;
                  IF ioffers IS NOT NULL THEN
                     UPDATE amazon_items SET
                            offers = crunch(ioffers),
                            offers_last_updated = CURRENT_TIMESTAMP,
                            offer_pages = ioffer_pages
                     WHERE asin = iasin;
                  END IF;
          END;
  END;
  $$
  LANGUAGE plpgsql;


Hi, did the index on isbn field help?

Another note, that is more fine tuning actually, then the real cause
of the slow execution of your procedure. If you are expecting to
update more, then insert, then you probably should not wait for the
exception to be thrown as all the BEGIN EXCEPTION END blocks are more
expensive to execute, then simple calls. Have a look here:
http://www.postgresql.org/docs/8.3/interactive/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING
Also note that if you UPDATE first, and then try to INSERT only when
UPDATE could not find anything to update, you really HAVE to expect
INSERT to fail and then retry updating, as another, parallel
transaction, could be fast enough to INSERT a record after you tried
to update and before your transaction starts to insert.

With best regards,

-- Valentine Gogichashvili

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance with temporary table

2008-04-10 Thread valgog

I see, I am having practically the same problem... utilizing
partitioning idea 
http://www.postgresql.org/docs/8.3/interactive/ddl-partitioning.html
by table inheritance.

I have prepared a post with some trigger and rule examples for you
http://valgogtech.blogspot.com/2008/04/table-partitioning-automation-triggers.html
. So I hope you will find it useful if you are not doing it already
yourself :-).

About the use of the temporary table, I would say, that you actually
could try to add some special row status flag colum (I use char for
such flags) to your partitioned tables to mark some rows as unused and
then create some conditional indexes that consider this flag for your
data operation... This would make it possible for you not to creating
temporary tables I hope...

With best regards,

-- Valentine

On Apr 10, 1:33 am, [EMAIL PROTECTED] (samantha mahindrakar)
wrote:
 The partitions are used to separate the data according to months. I
 have run a query o find bad data from each such partition. The
 imputation algorithm that i use requires data from 10 previous weeks
 in order to impute the data. This historical data i store in a
 temporary table, the i query this data so that i can take a average of
 all the historical data. Before taking average some computations are
 performed. Since i need the historical data for every minute of data
 that i need to impute i have to store the data in some intermediate
 table. Hence the temporary table.
 Now i changed the code to use a permanent table that is truncated
 after one set of data is imputed.
 I hope this makes sense.

 Samantha


 On Wed, Apr 9, 2008 at 6:44 AM, valgog [EMAIL PROTECTED] wrote:
  On Apr 7, 8:27 pm, [EMAIL PROTECTED] (samantha mahindrakar)
   wrote:

   Hi
I have written a program that imputes(or rather corrects data) with in
my database.
Iam using a temporary table where in i put data from other partitoined
table. I then query this table to get the desired data.But the thing
is this temporary table has to be craeted for every record that i need
to correct and there are thousands of such records that need to be
corrected.
So the program necessarily creates a temporary table evrytime it has
to correct a record. However this table is dropeed after each record
is corrected.
The program works fine.but it runs for a very long timeor it
runs for days.
Iam particularyly finding that it takes more time during this statement:

NOTICE:  theQuery in createtablevolumelaneshist CREATE TEMPORARY TABLE
predictiontable(lane_id, measurement_start, speed,volume,occupancy) AS
SELECT lane_id, measurement_start, speed,volume,occupancy
FROM samantha.lane_data_I_495 WHERE
lane_id IN (1317) AND
measurement_start BETWEEN '2007-11-18 09:25:00' AND 2007-11-19 01:39:06'

Iam not sure if i can use a cursor to replicate the functionality of
the temp table. Is the performance bad because of the creation and
deletion of the temp table?

Thanks
Samantha

--
Sent via pgsql-performance mailing list ([EMAIL PROTECTED])
To make changes to your 
  subscription:http://www.postgresql.org/mailpref/pgsql-performance

   And why do you copy data from the partition tables? Did you try to
   manipulate data directly in the needed tables? Or you are aggregating
   some of the data there? How the partitioning is actually designed? Do
   you use table inheritance?

   -- Valentine



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance with temporary table

2008-04-09 Thread valgog
On Apr 7, 8:27 pm, [EMAIL PROTECTED] (samantha mahindrakar)
wrote:
 Hi
 I have written a program that imputes(or rather corrects data) with in
 my database.
 Iam using a temporary table where in i put data from other partitoined
 table. I then query this table to get the desired data.But the thing
 is this temporary table has to be craeted for every record that i need
 to correct and there are thousands of such records that need to be
 corrected.
 So the program necessarily creates a temporary table evrytime it has
 to correct a record. However this table is dropeed after each record
 is corrected.
 The program works fine.but it runs for a very long timeor it
 runs for days.
 Iam particularyly finding that it takes more time during this statement:

 NOTICE:  theQuery in createtablevolumelaneshist CREATE TEMPORARY TABLE
 predictiontable(lane_id, measurement_start, speed,volume,occupancy) AS
 SELECT lane_id, measurement_start, speed,volume,occupancy
 FROM samantha.lane_data_I_495 WHERE
 lane_id IN (1317) AND
 measurement_start BETWEEN '2007-11-18 09:25:00' AND 2007-11-19 01:39:06'

 Iam not sure if i can use a cursor to replicate the functionality of
 the temp table. Is the performance bad because of the creation and
 deletion of the temp table?

 Thanks
 Samantha

 --
 Sent via pgsql-performance mailing list ([EMAIL PROTECTED])
 To make changes to your 
 subscription:http://www.postgresql.org/mailpref/pgsql-performance

And why do you copy data from the partition tables? Did you try to
manipulate data directly in the needed tables? Or you are aggregating
some of the data there? How the partitioning is actually designed? Do
you use table inheritance?

-- Valentine

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] response time when querying via JDBC and via psql differs

2008-02-26 Thread valgog
Do not use setString() method to pass the parameter to the
PreparedStatement in JDBC. Construct an SQL query string as you write
it here and query the database with this new SQL string. This will
make the planner to recreate a plan every time for every new SQL
string per session (that is not usually good) but it will make the
planner to choose a correct plan.

-- Valentine Gogichashvili

On Feb 25, 11:06 am, [EMAIL PROTECTED] (Pavel Rotek) wrote:
 Hi all,

   i have strange problem with performance in PostgreSQL (8.1.9). My problem
 shortly:

   I'm using postgreSQL via JDBC driver (postgresql-8.1-404.jdbc3.jar) and
 asking the database for search on table with approximately 3 000 000
 records.
   I have created functional index table(lower(href) varchar_pattern_ops)
 because of lower case like searching. When i ask the database directly
 from psql, it returns result in 0,5 ms, but when i put the same command via
 jdbc driver, it returns in 10 000 ms. Where can be the problem?? Any problem
 with PostgreSQL tuning??

 The command is
 select df.id as id, df.c as c, df.href as href, df.existing as existing,
 df.filesize as filesize from documentfile df where (lower(href) like
 'aba001!_223581.djvu' escape '!' ) order by  id limit 1   Thank you very
 much for any help,

   Kind regards,

   Pavel Rotek


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Index usage when bitwise operator is used

2007-09-18 Thread valgog
Hi Tom,

do you think it would be a good idea to ask GIN index team to
implement an int-based bitmap set indexing operator for GIN/GiST based
indexes? Or there will be a possibility to somehow optimally index
arrays of enumerations to implement such bitmap structures in 8.3 or
later postgresql versions?

With best regards,

-- Valentine

On Sep 17, 3:37 pm, [EMAIL PROTECTED] (Tom Lane) wrote:
 Kevin Grittner [EMAIL PROTECTED] writes:
  On Mon, Sep 17, 2007 at  2:49 AM, in message
  [EMAIL PROTECTED], valgog
  [EMAIL PROTECTED] wrote:=20
  Are you sure you understood what was the question?

  Is the TBL1.CATEGORY = TBL2.CATEGORY the same as TBL1.CATEGORY 
  TBL2.CATEGORY  0?
  Yes, given that he stipulated that one and only one bit would be set.

 Really?  In that case, isn't this bit-field just a bad implementation of
 an enum-style field?

 regards, tom lane

 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Index usage when bitwise operator is used

2007-09-17 Thread valgog
Hi,

I could not find and normal solution for that issue. But I am using
some workarounds for that issue.

The solution, that I am using now is to create an index for every bit
of your bitmap field.

So something like

CREATE INDEX idx_hobbybit_0_limited
  ON versionA.user_fast_index
  USING btree
  (gender, dateofbirth) -- here the gender and dateofbirth fields are
the fields that we usually ORDER BY in the select statements, but you
can play with the needed fields
  WHERE (hobby_bitmap  1)  0;

by creating such an index for every used bit and combining WHERE
(hobby_bitmap  1 )  0 like statements the planner will be choosing
the right index to use.

Another workaround, that will be more applicable in your case I think,
is to create a functional GIN index on your bitmap field using a
static function to create an array of bitmap keys from your bitmap
field.

CREATE OR REPLACE FUNCTION
versionA.bitmap_to_bit_array(source_bitmap integer)
  RETURNS integer[] AS
'select ARRAY( select (1  s.i) from generate_series(0, 32) as s(i)
where ( 1  s.i )  $1  0 )'
  LANGUAGE 'sql' IMMUTABLE STRICT;

And than create a GIN index on the needed field using this stored
procedure. After that, it would be possible to use intarray set
operators on the result of that function. This will also make it
possible to use that GIN index.

Actually it would be much much better if it were possible to build GIN
indexes directly on the bitmap fields. But this is to be implemented
by GIN and GiST index development team. Probably would be not a bad
idea to make a feature request on them.


With best regards,

Valentine Gogichashvili

On Sep 13, 2:30 pm, [EMAIL PROTECTED] (W.Alphonse HAROUNY) wrote:
 Hello,

 My question is about index usage when bitwise operations are invoked.
 Situation Context:
 --

 Lets suppose we have 2 tables TBL1 and TBL2 as the following:
 TBL1 {
   . ;
   integer categoryGroup; // categoryGroup is declared as an index on TABL1
   . ;

 }

 TBL2 {
   . ;
   integer categoryGroup; // categoryGroup is declared as an index on TABL2
   . ;

 }

 By conception, I suppose that:
 - [categoryGroup] may hold a limited number of values, less than 32 values.
 - [categoryGroup] is of type integer = it means 4 bytes = 32 bits
   = 32 places available to hold binary '0' or binary '1' values.
 - [categoryGroup] is the result of an OR bitwise operation among a
 predefined set of variables [variableCategory].
We suppose that [variableCategory] is of type integer (=32 bits)
and each binary value of [variableCategory] may only hold a single binary
 '1'.

 Ex: variableCategory1 = 0010
   variableCategory2 = 0010
   variableCategory3 = 1000

  If [categoryGroup] =  variableCategory1 | variableCategory2 |
 variableCategory3
 =[categoryGroup] = 00101010

 Question:
 --
 I have an SQL request similar to:

 SELECT . FROM TBL1, TBL2 WHERE
  inner join between TBL1 and TBL2 is True AND
  TBL1.CATEGORY  TBL2.CATEGORY  0  //-- where  is the AND bitwise
 operator

 Qst:
 1/ IS the above SQL request will use the INDEX [categoryGroup] defined on
 TBL1 and TBL2 ?
 2/ What should I do or How should I modify my SQL request in order
to force the query engine to use an index ? (the already defined index or
 another useful index)

 Thx a lot



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Index usage when bitwise operator is used

2007-09-17 Thread valgog

 What about saying?:

   TBL1.CATEGORY = TBL2.CATEGORY


Are you sure you understood what was the question?

Is the TBL1.CATEGORY = TBL2.CATEGORY the same as TBL1.CATEGORY 
TBL2.CATEGORY  0?


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] select count(*) performance

2007-08-13 Thread valgog
On Aug 11, 5:54 pm, Detlef Rudolph [EMAIL PROTECTED] wrote:
 Hello Group,

 I've tried the VACUUM ANALYSE, that doesn't help
 much, but VACUUM FULL improves Performance down
 from about 40 secs to 8. I think in future I would
 use the reltuples value from pg_class for the table.

 Thanks a lot for your answers and a good Sunday,
 Det

just do not forget, that reltuples is count and updated in pg_class
only during the vacuuming or analyzing of a table... so the value is
only an APPROXIMATE

-- Valentine


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] multicolumn index column order

2007-07-25 Thread valgog
On Jul 25, 2:14 am, Lew [EMAIL PROTECTED] wrote:

 How about two indexes, one on each column?  Then the indexes will cooperate
 when combined in a WHERE clause.
 http://www.postgresql.org/docs/8.2/interactive/indexes-bitmap-scans.html

 I don't believe the index makes a semantic difference with regard to ascending
 or descending.  An index is used to locate records in the selection phase of a
 query or modification command.

 --
 Lew

Ordered indexes (b-tree in this case) are also used to get the needed
record order and it is absolutely not necessary to have a WHARE clause
in your select statement to use them when you are using ORDER BY.

--
Valentine


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] disable archiving

2007-07-24 Thread valgog
On Jul 23, 7:24 pm, [EMAIL PROTECTED] (Paul van den Bogaard)
wrote:
 the manual somewhere states ... if archiving is enabled... To me
 this implies that archiving can be disabled. However I cannot find
 the parameter to use to get this result. Or should I enable archiving
 and use a backup script like

 #!/usr/bin/bash
 exit 0

 Would appreciate a hint. And yes I know I put my database in danger
 etc. This is for some benchmarks where I do not want the overhead of
 archiving. Jus a file system that will not fill with zillions of
 these 16MB WAL files ;^)

 Thanks
 Paul.

 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Is it normal to spoil other threads? or is it a bug?

If it is not a bug, please change the subject of the topic back to
what it was!

With best regards,

Valentine Gogichashvili


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PERFORM] multicolumn index column order

2007-07-23 Thread valgog
Hello all,

how to build an multicolumn index with one column order ASCENDING and
another column order DESCENDING?

The use case that I have is that I use 2 column index where the first
column is kind of flag and the second column is an actual ordering
column. The flag should be always ordered DESCENDING, but the second
column is ordered DESCENDING when it is a numeric column, and
ASCENDING when it is a text column.

CREATE TABLE storage (id int, flag int, numeric_data int, text_data
text);

SELECT * FROM storage
 ORDER BY flag DESC, numeric_column DESC
LIMIT 20 OFFSET 0;

SELECT * FROM storage
 ORDER BY flag DESC, text_column ASC
LIMIT 20 OFFSET 0;

Definitely the multicolumn index on (flag, numeric_column) is being
used.

But how to create an index on (flag, text_column DESC)?

I will try to index by ((-flag), text_column) and sort by (-flag) ASC,
but it, to say the truth, does not really look like a nice solution.


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] multicolumn index column order

2007-07-23 Thread valgog
On Jul 23, 7:00 pm, [EMAIL PROTECTED] (Tom Lane) wrote:
 valgog [EMAIL PROTECTED] writes:
  how to build an multicolumn index with one column order ASCENDING and
  another column order DESCENDING?

 Use 8.3 ;-)

 In existing releases you could fake it with a custom reverse-sorting
 operator class, but it's a pain in the neck to create one.

 regards, tom lane

 ---(end of broadcast)---
 TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

ok, thanks for a rapid answer, can live with the ((-flag),
text_column) functional multicolumn index by now.

Waiting for 8.3 :-)


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] [PERFORMANCE] is it possible to force an index to be held in memory?

2007-07-02 Thread valgog
Hi,

I have found some discussions about that issue, but did not find the
answer actually.

Is there a way to be sure, that some indexes are alway in memory? My
tests bringing them to the memory based file system (ramfs) tablespace
showed really a very significant performance gain. But a perspective
to have a corrupted database after each machine restart does not
really make me feel good.

With best regards,

-- Valentine Gogichashvili


---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread valgog
I found several post about INSERT/UPDATE performance in this group,
but actually it was not really what I am searching an answer for...

I have a simple reference table WORD_COUNTS that contains the count of
words that appear in a word array storage in another table.

CREATE TABLE WORD_COUNTS
(
  word text NOT NULL,
  count integer,
  CONSTRAINT PK_WORD_COUNTS PRIMARY KEY (word)
)
WITHOUT OIDS;

I have some PL/pgSQL code in a stored procedure like

  FOR r
   IN select id, array_of_words
from word_storage
  LOOP
begin
  -- insert the missing words
  insert into WORD_COUNTS
  ( word, count )
  ( select word, 0
  from ( select distinct (r.array_of_words)
[s.index] as d_word
   from generate_series(1,
array_upper( r.array_of_words, 1 ) ) as s(index) ) as distinct_words
 where word not in ( select d_word from
WORD_COUNTS  ) );
  -- update the counts
  update WORD_COUNTS
 set count = COALESCE( count, 0 ) + 1
   where word in ( select distinct (r.array_of_words)[s.index] as
word
from generate_series(1,
array_upper( r.array_of_words, 1) ) as s(index) );
exception when others then
  error_count := error_count + 1;
end;
record_count := record_count + 1;
  END LOOP;

This code runs extremely slowly. It takes about 10 minutes to process
1 records and the word storage has more then 2 million records to
be processed.

Does anybody have a know-how about populating of such a reference
tables and what can be optimized in this situation.

Maybe the generate_series() procedure to unnest the array is the place
where I loose the performance?

Are the set update/inserts more effitient, then single inserts/updates
run in smaller loops?

Thanks for your help,

Valentine Gogichashvili


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread valgog
I have rewritten the code like

  existing_words_array := ARRAY( select word
   from WORD_COUNTS
  where word = ANY
( array_of_words ) );
  not_existing_words_array := ARRAY( select distinct_word
   from ( select distinct
(array_of_words)[s.index] as distinct_word
from
generate_series(1, array_upper( array_of_words, 1 ) ) as s(index)
 ) as distinct_words
  where distinct_word  ALL
( existing_words_array ) );
  -- insert the missing words
  if not_existing_words_array is not null then
insert into WORD_COUNTS
( word, count )
( select word, 1
from ( select
not_existing_words_array[s.index] as word
 from generate_series( 1,
array_upper( not_existing_words_array, 1 ) ) as s(index) ) as
distinct_words
 );
  end if;
  -- update the counts
  if existing_words_array is not null then
update WORD_COUNTS
   set count = COALESCE( count, 0 ) + 1
 where sw_word = ANY ( existing_words_array );
  end if;


Now it processes a million records in 14 seconds... so it was probably
the problem of looking up NOT IN WORD_COUNTS was way too expencive


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread valgog
On May 22, 12:14 pm, [EMAIL PROTECTED] (PFC) wrote:
 On Tue, 22 May 2007 10:23:03 +0200, valgog [EMAIL PROTECTED] wrote:
  I found several post about INSERT/UPDATE performance in this group,
  but actually it was not really what I am searching an answer for...

  I have a simple reference table WORD_COUNTS that contains the count of
  words that appear in a word array storage in another table.

 Mmm.

 If I were you, I would :

 - Create a procedure that flattens all the arrays and returns all the 
  
 words :

 PROCEDURE flatten_arrays RETURNS SETOF TEXT
 FOR word_array IN SELECT word_array FROM your_table LOOP
 FOR i IN 1...array_upper( word_array ) LOOP
 RETURN NEXT tolower( word_array[ i ] )

 So, SELECT * FROM flatten_arrays() returns all the words in all the arrays.
 To get the counts quickly I'd do this :

 SELECT word, count(*) FROM flatten_arrays() AS word GROUP BY word

 You can then populate your counts table very easily and quickly, since  
 it's just a seq scan and hash aggregate. One second for 10.000 rows would  
 be slow.

 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

good idea indeed! will try this approach.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread valgog
On May 22, 12:00 pm, valgog [EMAIL PROTECTED] wrote:
 I have rewritten the code like

   existing_words_array := ARRAY( select word
from WORD_COUNTS
   where word = ANY
 ( array_of_words ) );
   not_existing_words_array := ARRAY( select distinct_word
from ( select distinct
 (array_of_words)[s.index] as distinct_word
 from
 generate_series(1, array_upper( array_of_words, 1 ) ) as s(index)
  ) as distinct_words
   where distinct_word  ALL
 ( existing_words_array ) );
   -- insert the missing words
   if not_existing_words_array is not null then
 insert into WORD_COUNTS
 ( word, count )
 ( select word, 1
 from ( select
 not_existing_words_array[s.index] as word
  from generate_series( 1,
 array_upper( not_existing_words_array, 1 ) ) as s(index) ) as
 distinct_words
  );
   end if;
   -- update the counts
   if existing_words_array is not null then
 update WORD_COUNTS
set count = COALESCE( count, 0 ) + 1
  where sw_word = ANY ( existing_words_array );
   end if;

 Now it processes a million records in 14 seconds... so it was probably
 the problem of looking up NOT IN WORD_COUNTS was way too expencive

Sorry... this code did not update anythig at all, as I forgot about
the NULL values... had to COALASCE practically everything and use
array_upper()... do not have the performance numbers of the insert,
updates yet...


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly