[PERFORM] tuning Postgres for large data import (using Copy from)

2005-05-12 Thread Marc Mamin
Hello,


I'd like to tune Postgres for large data import (using Copy from).


here are a few steps already done:



1) use 3 different disks for:

-1: source data
-2: index tablespaces
-3: data tablespaces


2) define all foreign keys as initially deferred


3) tune some parameters:



max_connections =20
shared_buffers =3
work_mem = 8192  
maintenance_work_mem = 32768
checkpoint_segments = 12

(I also modified the kernel accordingly)




4) runs VACUUM regulary


The server runs RedHat and has 1GB RAM

In the production (which may run on a better server), I plan to: 

- import a few millions rows per day,
- keep up to ca 100 millions rows in the db
- delete older data




I've seen a few posting on hash/btree indexes, which say that hash index do
not work very well in Postgres;
currently, I only use btree indexes. Could I gain performances whole using
hash indexes as well ?

How does Postgres handle concurrent copy from on: same table / different
tables ?


I'd be glad on any further suggestion on how to further increase my
performances.




Marc




-- 
+++ Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS +++
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Optimising queries involving unions

2005-05-27 Thread Marc Mamin
Hi,

I'm using a workaround for this kind of issues:


consider:

select A from 

   (select B from T1 where C 
  union
select B from T2 where C 
  union
select B from T3 where C 
) foo
where D


in your case:

SELECT u.txt
  FROM  (
SELECT id, txt FROM largetable1,smalltable t WHERE t.id = u.id   AND
t.foo = 'bar'
UNION ALL
SELECT id, txt FROM largetable2,smalltable t WHERE t.id = u.id   AND
t.foo = 'bar'
) u
  



and

select A from foo where C and D

(A, B, C, D being everything you want, C and D may also include GROUP
BY,ORDER...)

The first version will be handled correctly by the optimiser, whereas in the
second version, 
Postgres will first build the UNION and then run the query on it.




I'm having large tables with identical structure, one per day.
Instead of defining a view on all tables, 
I' using functions that distribute my query on all tables.

The only issue if that I need to define a type that match the result
structure and each return type needs its own function.


Example:
(The first parameter is a schema name, the four next corresponds to A, B, C,
D





-
create type T_i2_vc1 as (int_1 int,int_2 int,vc_1 varchar);

CREATE OR REPLACE FUNCTION
vq_T_i2_vc1(varchar,varchar,varchar,varchar,varchar) RETURNS setof T_i2_vc1
AS $$


DECLARE
result T_i2_vc1%rowtype;
mviews RECORD;
sql varchar;
counter int;
BEGIN
select into counter 1;

-- loop on all daily tables
FOR mviews IN SELECT distinct this_day FROM daylist order by 
plainday
desc LOOP

IF counter =1 THEN
  select INTO  sql 'SELECT '||mviews.this_day||' AS plainday, 
'||$2||'
FROM '||$3||'_'||mviews.plainday||' WHERE '||$4;
ELSE
  select INTO  sql sql||' UNION ALL SELECT 
'||mviews.this_day||' AS
plainday, '||$2||' FROM '||$3||'_'||mviews.plainday||' WHERE '||$4;
END IF;

select into counter counter+1;
END LOOP;

select INTO  sql 'SELECT  '||$1||' FROM ('||sql||')foo '||$5;

   for result in   EXECUTE (sql) LOOP
 return  NEXT result;   
   end loop;
 return ;

END;
$$ LANGUAGE plpgsql;



Note: in your case the function shoud have a further parameter to join
largetable(n) to smalltable in the sub queries

HTH,

Marc





 I've got a query that I think the query optimiser should be able
 to work it's magic on but it doesn't!  I've had a look around and
 asked on the IRC channel and found that the current code doesn't
 attempt to optimise for what I'm asking it to do at the moment.
 Here's a bad example:
 
   SELECT u.txt
   FROM smalltable t, (
 SELECT id, txt FROM largetable1
 UNION ALL
 SELECT id, txt FROM largetable2) u
   WHERE t.id = u.id
 AND t.foo = 'bar';
 
 I was hoping that smalltable would get moved up into the union,
 but it doesn't at the moment and the database does a LOT of extra
 work.  In this case, I can manually do quite a couple of transforms
 to move things around and it does the right thing:
 
   SELECT txt
   FROM (
 SELECT l.id as lid, r.id as rid, r.foo, l.txt
   FROM largetable1 l, smalltable r
 UNION ALL
 SELECT l.id as lid, r.id as rid, r.foo, l.txt
   FROM largetable1 l, smalltable r)
   WHERE foo = 'bar';
 AND lid = rid
 
 The optimiser is intelligent enough to move the where clauses up
 into the union and end end up with a reasonably optimal query.
 Unfortunatly, in real life, the query is much larger and reorganising
 everything manually isn't really feasible!

-- 
Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie!
Ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] TIP 9: the planner will ignore... datatypes

2005-06-01 Thread Marc Mamin
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


But INT2, INT4, INT8 and SERIAL are considered to be a unique datatype.
Am I Right?

Thanks,

Marc

-- 
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden  testen ++ http://www.gmx.net/de/go/promail ++

---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] Query limitations (size, number of UNIONs ...)

2005-06-03 Thread Marc Mamin

Hello,


I've split my data in daily tables to keep them in an acceptable size.

Now I have quite complex queries which may be very long if I need to query a
large number of daily tables.


I've just made a first test wich resulted in a query being 15KB big annd
containing 63 UNION.

The Query plan in PGAdmin is about 100KB big with 800 lines :-)


The performance is not such bad, but I'm wondering if there are some
POSTGRES limitations I should take care of with this strategy.


Thanks,

Marc

-- 
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden  testen ++ http://www.gmx.net/de/go/promail ++

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Looking for tips

2005-07-23 Thread Marc Mamin
Title: Re: [PERFORM] Looking for tips







Hi,
I have a similar application,
but instead of adding new items to the db once at time,
Iretrieve new IDs from a sequence (actually only every 10'000 
times) and write a csv file from perl.
When finished, I load all new record in one run with Copy.

hth,

Marc Mamin



From: 
[EMAIL PROTECTED] on behalf of Oliver 
CrosbySent: Wed 7/20/2005 3:50 AMTo: PFCCc: 
Sven Willenberger; Dawid Kuroczko; Kevin Grittner; [EMAIL PROTECTED]; 
pgsql-performance@postgresql.orgSubject: Re: [PERFORM] Looking for 
tips

Sorry for the lack of specifics...We have a file 
generated as a list of events, one per line. Supposelines 1,2,3,5,7,11,etc 
were related, then the last one would specifythat it's the last event. 
Gradually this gets assembled by a perlscript and when the last event is 
encountered, it gets inserted intothe db. For a given table, let's say it's 
of the form (a,b,c) where'a' is a pkey, 'b' is indexed, and 'c' is other 
related information.The most common 'b' values are cached locally with the 
perl script tosave us having to query the db. So what we end up having 
is:if 'b' exists in cache, use cached 'a' value and continueelse if 
'b' exists in the db, use the associated 'a' value and continueelse add a 
new line with 'b', return the new 'a' and continueThe local cache was a 
huge time saver with mysql. I've tried making aplpgsql function that handles 
everything in one step on the db side,but it didn't show any improvement. 
Time permitting, I'll try some newapproaches with changing the scripts and 
queries, though right now Iwas just hoping to tune postgresql.conf to work 
better with thehardware available.Thanks to everyone for your help. 
Very much appreciated.---(end of 
broadcast)---TIP 5: don't forget to increase your 
free space map settings




Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe

2007-03-30 Thread Marc Mamin
Hello Peter,

If you are dealing with timed data or similar, you may consider to
partition your table(s).

In order to deal with large data, I've built a logical partition
system, 
whereas the target partition is defined by the date of my data (the date
is part of the filenames that I import...).

Instead of using the Postgres partitioning framework, I keep the tables
boundaries within a refererence table.
Then I've built a function that takes the different query parameters as
argument (column list, where clause...). 
This functions retrieve the list of tables to query from my reference
table and build the final query, binding 
the different subqueries from each partition with UNION ALL. 
It also requires an additional reference table that describes the table
columns (data type, behaviour , e.g. groupable,summable...)


This allowed me to replace many delete with drop table statements,
whis is probably the main advantage of the solution.


The biggest issue was the implementation time ;-) but I'm really happy
with the resulting performances.

HTH,

Marc



 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Peter
Schuller
Sent: Friday, March 30, 2007 7:17 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Scaling SELECT:s with the number of disks on a stripe

Hello,

I am looking to use PostgreSQL for storing some very simple flat data
mostly in a single table. The amount of data will be in the hundreds of
gigabytes range. Each row is on the order of 100-300 bytes in size; in
other words, small enough that I am expecting disk I/O to be seek bound
(even if PostgreSQL reads a full pg page at a time, since a page is
significantly smaller than the stripe size of the volume).

The only important performance characteristics are insertion/deletion
performance, and the performance of trivial SELECT queries whose WHERE
clause tests equality on one of the columns.

Other than absolute performance, an important goal is to be able to
scale fairly linearly with the number of underlying disk drives. We are
fully willing to take a disk seek per item selected, as long as it
scales.

To this end I have been doing some benchmarking to see whether the plan
is going to be feasable. On a 12 disk hardware stripe, insertion
performance does scale somewhat with concurrent inserters. However, I am
seeing surprising effects with SELECT:s: a single selecter generates the
same amount of disk activity as two concurrent selecters (I was easily
expecting about twice).

The query is simple:

SELECT * FROM test WHERE value = 'xxx' LIMIT 1000;

No ordering, no joins, no nothing. Selecting concurrently with two
different values of 'xxx' yields the same amount of disk activity (never
any significant CPU activity). Note that the total amount of data is too
large to fit in RAM ( 500 million rows), and the number of distinct
values in the value column is 1. The column in the WHERE clause is
indexed.

So my first question is - why am I not seeing this scaling? The absolute
amount of disk activity with a single selecter is consistent with what I
would expect from a SINGLE disk, which is completely expected since I
never thought PostgreSQL would introduce disk I/O concurrency on its
own. But this means that adding additional readers doing random-access
reads *should* scale very well with 12 underlying disks in a stripe.

(Note that I have seen fairly similar results on other RAID variants
too, including software RAID5 (yes yes I know), in addition to the
hardware stripe.)

These tests have been done Linux 2.6.19.3 and PostgreSQL 8.1.

Secondly, I am seeing a query plan switch after a certain threshold.
Observe:

perftest=# explain select * from test where val='7433' limit 1000; 
   QUERY PLAN


-
 Limit  (cost=0.00..4016.50 rows=1000 width=143)
   -  Index Scan using test_val_ix on test  (cost=0.00..206620.88
rows=51443 width=143)
 Index Cond: ((val)::text = '7433'::text)
(3 rows)

Now increasing to a limit of 1:

perftest=# explain select * from test where val='7433' limit 1;
  QUERY PLAN


--
 Limit  (cost=360.05..38393.36 rows=1 width=143)
   -  Bitmap Heap Scan on test  (cost=360.05..196014.82 rows=51443
width=143)
 Recheck Cond: ((val)::text = '7433'::text)
 -  Bitmap Index Scan on test_val_ix  (cost=0.00..360.05
rows=51443 width=0)
   Index Cond: ((val)::text = '7433'::text)
(5 rows)

The interesting part is that the latter query is entirely CPU bound (no
disk I/O at all) for an extended period of time before even beginning to
read data from disk. And when it *does* start performing disk I/O, the
performance is about the same as for the other case. In other words, the
change in query plan seems to do 

Re: [PERFORM] Planner doing seqscan before indexed join

2007-03-30 Thread Marc Mamin
You may try to change the planner's opinion using sub queries. Something
like:


select * from 
   
   eventactivity,   

   (select * from 
  keyword_incidents, 
  eventmain, 
  eventgeo 
where 
  eventmain.incidentid = keyword_incidents.incidentid 
  and eventgeo.incidentid = keyword_incidents.incidentid 
  and (  recordtext like '%JOSE CHAVEZ%'   )
   )foo
   
 where eventactivity.incidentid = foo.incidentid 
 order by foo.entrydate limit 1;


HTH,

Marc

 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Dan Harris
Sent: Thursday, March 29, 2007 4:22 AM
To: PostgreSQL Performance
Subject: [PERFORM] Planner doing seqscan before indexed join

8.0.3 - Linux 2.6.18..
Freshly vacuumed and analyzed

This database has been humming along fine for a while now, but I've got
one of those sticky queries that is taking much too long to finish.

After some digging, I've found that the planner is choosing to apply a
necessary seq scan to the table.  Unfortunately, it's scanning the whole
table, when it seems that it could have joined it to a smaller table
first and reduce the amount of rows it would have to scan dramatically (
70 million to about 5,000 ).

The table eventactivity has about 70million rows in it, index on
incidentid
The table keyword_incidents is a temporary table and has incidentid as
its primary key.  It contains
5125 rows. eventmain and eventgeo both have 2.1 million. My hope is that
I can convince the planner to do the
  join to keyword_incidents *first* and then do the seq scan for the
LIKE condition.  Instead, it seems that it's seqscanning the whole 70
million rows first and then doing the join, which takes a lot longer
than I'd like to wait for it.  Or, maybe I'm misreading the explain
output?

Thanks again

-Dan
-
Here's the query:

explain analyze 

select 
   * from 

   keyword_incidents, 

   eventactivity, 

   eventmain, 

   eventgeo 

  where 

   eventmain.incidentid = keyword_incidents.incidentid and 

   eventgeo.incidentid = keyword_incidents.incidentid and 

   eventactivity.incidentid = keyword_incidents.incidentid 

   and (  recordtext like '%JOSE CHAVEZ%'   )
order by eventmain.entrydate limit 1;



---
  Limit  (cost=2388918.07..2388918.08 rows=1 width=455) (actual
time=81771.186..81771.292 rows=26 loops=1)
-  Sort  (cost=2388918.07..2388918.08 rows=1 width=455) (actual
time=81771.180..81771.215 rows=26 loops=1)
  Sort Key: eventmain.entrydate
  -  Nested Loop  (cost=0.00..2388918.06 rows=1 width=455)
(actual time=357.389..81770.982 rows=26 loops=1)
-  Nested Loop  (cost=0.00..2388913.27 rows=1
width=230) (actual time=357.292..81767.385 rows=26 loops=1)
  -  Nested Loop  (cost=0.00..2388909.33 rows=1
width=122) (actual time=357.226..81764.501 rows=26 loops=1)
-  Seq Scan on eventactivity
(cost=0.00..2388874.46 rows=7 width=84) (actual time=357.147..81762.582
rows=27 loops=1)
  Filter: ((recordtext)::text ~~ '%JOSE
CHAVEZ%'::text)
-  Index Scan using keyword_incidentid_pkey
on keyword_incidents  (cost=0.00..4.97 rows=1 width=38) (actual
time=0.034..0.036 rows=1 loops=27)
  Index Cond:
((outer.incidentid)::text = (keyword_incidents.incidentid)::text)
  -  Index Scan using eventgeo_incidentid_idx on
eventgeo  (cost=0.00..3.93 rows=1 width=108) (actual
time=0.076..0.081 rows=1 loops=26)
Index Cond: ((outer.incidentid)::text =
(eventgeo.incidentid)::text)
-  Index Scan using eventmain_incidentid_idx on
eventmain  (cost=0.00..4.78 rows=1 width=225) (actual
time=0.069..0.075 rows=1 loops=26)
  Index Cond: ((outer.incidentid)::text =
(eventmain.incidentid)::text)
  Total runtime: 81771.529 ms
(15 rows)

---(end of broadcast)---
TIP 6: explain analyze is your friend

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PERFORM] copy from performance on large tables with indexes

2007-06-07 Thread Marc Mamin

Hello,


Postgres: 8.2
os: Linux 4CPU, 4 GB RAM, Raid 1, 32 bit system
work_mem: 600 Mb


I have some tables which may become quite large (currently up to 6 Gb) .
I initially fill them using copy from (files) .

The import is fast enough as I only have a primary key on the table:
about 18 minutes
(over 300 Mb/minute)

Then I need 5 additional indexes on it. Creation time: 30 minutes


subsequently I compute some aggregations which need 4 hours and 30
minutes additional time


And now the problem:

If I get additional data for the table, the import become much more
slower due to the indexes (about 30 times slower !):

The performance degradation  is probably  due to the fact that all
indexs are too large to be kept in memory. 
Moreover I guess that the indexes fill factors are too high (90%)

During this second import, I have about 20% iowait time.



The usual solution is to drop the indexes before the second import and
rebuild them afterwards, but I feel unconfident doing this as I don't
know how the system will react if some SELECT statements occures when
the index are missing. I can hardly avoid this.


So my idea for the second import process:


1) make a copy of the table:

   create table B as select * from table A;
   alter table B add constraint B_pk primary key (id);


2) import the new data in table B

   copy B from file;

3) create the required indexes on B

   create index Bix_1 on B..
   create index Bix_2 on B..
   create index Bix_2 on B..
   create index Bix_2 on B..
   
4) replace table A with table B

   alter table A renam to A_trash;
   alter table B renam to A;
   drop table A_trash;

 (and rename the indexes to get the  original state)
 
 
 
 
 
 This seems to work but with side effects:
 
 The only objects that refer to the tables are functions and indexes.
 
If a function is called within a same session before and after the table
renaming, the second attempt fails (or use the table A_trash if it still
exists). So I should close the session and start a new one before
further processing. Errors in other live sessions are acceptable, but
maybe you know a way to avoid them?)



And now a few questions :-)

- do you see any issue that prevent this workflow to work?

- is there any other side effect to take care of ?

- what is the maximum acceptable value for the parameter work_mem for my
configuration 
  (see the complete configuration below)
  
- has anybody built a similar workflow ?  

- could this be a feature request to extend the capabilities of copy
from ?



Thanks for your time and attention,

Marc Mamin

 


[PERFORM] tuning a function to insert/retrieve values from a reference table

2007-07-10 Thread Marc Mamin

Hello,

I have a simple table id/value, and a function that returns the id of a
given value, inserting the later if not yet present. The probability
that a value already exist within the reference table is very high.

Different db users may have their own reference table with different
content, but as the table definition is identical, I've defined a public
function to maintain these tables. 

Can I optimize this function with:

a) remove the EXCEPTION clause (Is there an underlying lock that prevent
concurrent inserts ?)

b) declare the function being IMMUTABLE ?
   
   - although it may insert a new raw, the returned id is invariant for
a given user
 (I don't really understand the holdability ov immutable functions;
are the results cached only for the livetime of a prepared statement ?,
or can they be shared by different sessions ?)


Thanks,

Marc




--Table definition:

create table ref_table (
  id serial NOT NULL, 
  v varchar NOT NULL, 
  constraint ref_table_pk primary key  (id)
) without oids;

create unique index ref_table_uk on ref_table(v);


-- Function:

CREATE OR REPLACE FUNCTION public.get_or_insert_value(varchar) RETURNS
INT AS 
$BODY$

DECLARE
  id_value INT;

BEGIN

  SELECT INTO id_value id FROM ref_table WHERE v =  $1;

  IF FOUND THEN

RETURN id_value;

  ELSE  --new value to be inserted

DECLARE
  rec record;

BEGIN

 FOR rec in INSERT INTO ref_table (v) VALUES ($1) RETURNING id
 LOOP
  return rec.id;  
 END LOOP;

 EXCEPTION --concurrent access ?
   WHEN unique_violation THEN
 RETURN(SELECT id FROM ref_table WHERE v =  $1);

END;

  END IF;
END;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE;



[PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...

2007-07-17 Thread Marc Mamin

Postgres configuration for 64 CPUs, 128 GB RAM...

Hello,

We have the oppotunity to benchmark our application on a large server. I
have to prepare the Postgres configuration and I'd appreciate some
comments on it as I am not experienced with servers of such a scale.
Moreover the configuration should be fail-proof as I won't be able to
attend the tests. 

Our application (java + perl) and Postgres will run on the same server,
whereas the application activity is low when Postgres has large
transactions to process.

There is a large gap between our current produtcion server (Linux, 4GB
RAM, 4 cpus) and the benchmark server; one of the target of this
benchmark is to verify the scalability of our application. 


And you have no reason to be envious as the server doesn't belong us :-)


Thanks for your comments,

Marc Mamin





Posgres version: 8.2.1



Server Specifications:
--

Sun SPARC Enterprise M8000 Server:

http://www.sun.com/servers/highend/m8000/specs.xml

File system:

http://en.wikipedia.org/wiki/ZFS



Planned configuration:


# we don't expect more than 150 parallel connections, 
# but I suspect a leak in our application that let some idle connections
open

max_connections=2000

ssl = off 

#maximum allowed
shared_buffers= 262143

# on our current best production server with 4GB RAM (not dedicated to
Postgres), work_mem is set to 600 MB
# this limitation is probably the bottleneck for our application as the
files in pgsql_tmp grows up to 15 GB 
# during large aggregations (we have a locking mechanismus to avoid
parallel processing of such transactions)
work_mem = 31457280  # (30 GB)

# index creation time is also an issue for us; the process is locking
other large processes too.
# our largest table so far is 13 GB + 11 GB indexes
maintenance_work_mem = 31457280  # (30 GB)

# more than the max number of tables +indexes expected during the
benchmark
max_fsm_relations = 10

max_fsm_pages = 180

# don't know if I schoud modify this.
# seems to be sufficient on our production servers
max_stack_depth = 2MB

# vacuum will be done per hand between each test session
autovacuum = off 



# required to analyse the benchmark
log_min_duration_statement = 1000


max_prepared_transaction = 100


# seems to be required to drop schema/roles containing large number of
objects
max_locks_per_transaction = 128 




# I use the default for the bgwriter as I couldnt find recommendation on
those

#bgwriter_delay = 200ms # 10-1ms between rounds
#bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers
scanned/round
#bgwriter_lru_maxpages = 5  # 0-1000 buffers max
written/round
#bgwriter_all_percent = 0.333   # 0-100% of all buffers
scanned/round
#bgwriter_all_maxpages = 5  # 0-1000 buffers max
written/round


#WAL

fsync = on

#use default
#wal_sync_method

# we are using 32 on our production system
wal_buffers=64


# we didn't make any testing with this parameter until now, but this
should'nt be a relevant
# point as our performance focus is on large transactions
commit_delay = 0 

#CHECKPOINT

# xlog will be  on a separate disk
checkpoint_segments=256

checkpoint_timeout = 5min


Re: [PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...

2007-07-24 Thread Marc Mamin
 
Hello,

thank you for all your comments and recommendations.

I'm aware that the conditions for this benchmark are not ideal, mostly
due to the lack of time to prepare it. We will also need an additional
benchmark on a less powerful - more realistic - server to better
understand the scability of our application.


Our application is based on java and is generating dynamic reports from
log files content. Dynamic means here that a repor will be calculated
from the postgres data the first time it is requested (it will  then be
cached). Java is used to drive the data preparation and to
handle/generate the reports requests.

This is much more an OLAP system then an OLTP, at least for our
performance concern.




Data preparation:

1) parsing the log files with a heavy use of perl (regular expressions)
to generate csv files. Prepared statements also maintain reference
tables in the DB. Postgres performance is not an issue for this first
step.

2) loading the csv files with COPY. As around 70% of the data to load
come in a single daily table, we don't allow concurrent jobs for this
step. We have between a few and a few hundreds files to load into a
single table; they are processed one after the other. A primary key is
always defined; for the case when the required indexes are alreay built
and when the new data are above a given size, we are using a shadow 
table  instead (without the indexes) , build the index after the import
and then replace the live table with the shadow one. 
For example, we a have a table of 13 GB + 11 GB indexes (5 pieces).

Performances :

a) is there an ideal size to consider for our csv files (100 x 10
MB or better 1 x 1GB ?)
b) maintenance_work_mem: I'll use around 1 GB as recommended by
Stefan

3) Data agggregation. This is the heaviest part for Postgres. On our
current system some queries need above one hour, with phases of around
100% cpu use, alterning with times of heavy i/o load when temporary
results are written/read to the plate (pgsql_tmp). During the
aggregation, other postgres activities are low (at least should be) as
this should take place at night. Currently we have a locking mechanism
to avoid having more than one of such queries running concurently. This
may be to strict for the benchmark server but better reflect our current
hardware capabilities.

Performances : Here we should favorise a single huge transaction and
consider a low probability to have another transaction requiring large
sort space. Considering this, is it reasonable to define work_mem being
3GB (I guess I should raise this parameter dynamically before running
the aggregation queries)

4) Queries (report generation)

We have only few requests which are not satisfying while requiring large
sort operations. The data are structured in different aggregation levels
(minutes, hours, days) with logical time based partitions in oder to
limit the data size to compute for a given report. Moreover we can scale
our infrastrucure while using different or dedicated Postgres servers
for different customers. Smaller customers may share a same instance,
each of them having its own schema (The lock mechanism for large
aggregations apply to a whole Postgres instance, not to a single
customer) . The benchmark will help us to plan such distribution.

During the benchmark, we will probably not have more than 50 not idle
connections simultaneously. It is a bit too early for us to fine tune
this part. The benchmark will mainly focus on the steps 1 to 3

During the benchmark, the Db will reach a size of about 400 GB,
simulating 3 different customers, also with data quite equally splitted
in 3 scheemas.



I will post our configuration(s) later on.



Thanks again for all your valuable input.

Marc Mamin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] join tables vs. denormalization by trigger

2007-09-04 Thread Marc Mamin
Hello,

I had a similar issue and -atfer testing - decided to merge the tables
B and C into a single table.
In my case the resulting table contains a large proportion of nulls
which limits the size increase...
You'll have to do some testing with your data to evaluate the
performance gain.

Hope to help,

Marc  

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Walter
Mauritz
Sent: Tuesday, September 04, 2007 8:53 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] join tables vs. denormalization by trigger

Hi,

I wonder about differences in performance between two scenarios:

Background:
Table A, ~50,000 records
Table B, ~3,000,000 records (~20 cols)
Table C, ~30,000,000 records (~10 cols)

a query every 3sec. with limit 10

Table C depends on Table B wich depends on Table A, int8 foreign key,
btree index

* consider it a read only scenario (load data only in night, with time
for vacuum analyze daily)
* im required to show records from Table C, but also with some (~5cols)
info from Table B
* where clause always contains the foreign key to Table A
* where clause may contain further 1-10 search parameter


Scenario A)
simply inner join Table B + C

Scenario B)
with use of trigger on insert/update I could push the required
information from table B down to table C.
- so i would only require to select from table C.


My question:
1) From your experience ... how much faster (approximately) in percent
do you regard Scenario B faster than A ?

2) any other tips for such a read only scenario

Thx for any attention :-)
Walter
-- 
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Delete performance again

2008-10-03 Thread Marc Mamin
Hi,
 
Maybe you can try this syntax. I'm not sure, but it eventually perform better:
 
 
delete from company_alias USING comprm
where company_alias.company_id =comprm.id


Cheers,

Marc

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] temp_tablespaces and RAID

2008-12-22 Thread Marc Mamin

Hello,

To improve performances, I would like to try moving the temp_tablespaces
locations outside of our RAID system.
Is it a good practice ?


Thanks,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very specialised query

2009-03-27 Thread Marc Mamin
Hello,

if your data are mostly static and you have a few mains objects,
maybe you can have some gain while defining conditional indexes for those plus 
one for the rest 
and then slicing the query:


create index o_1x on X (start,end,id) where object_id = 1
create index o_2x on X (start,end,id) where object_id = 2
create index o_3x on X (start,end,id) where object_id = 3
create index o_4x on X (start,end,id) where object_id = 4
...
create index o_4x on X (start,end,id) where object_id not in (1,2,3,4..)


I'm not sure that putting all in one index and using the BETWEEN clause 
as in my example is the best method though.

Marc Mamin


SELECT 
l1.id AS id1,
l2.id AS id2
FROM
location l1,
location l2
WHERE l1.objectid = 1
AND (l2.start BETWEEN  l1.start AND l1.end
 OR 
 l1.start BETWEEN  l2.start AND l2.end
 )
 l1.start
AND l2.start  l2.start -- if required
AND l2.start  l2.end   -- if required
AND l1.id  l2.id


UNION ALL

...
WHERE l1.objectid = 2
... 

UNION ALL

...
WHERE l1.objectid not in (1,2,3,4..)


Re: [PERFORM] Very specialised query

2009-03-30 Thread Marc Mamin

 WHERE (l2.start BETWEEN  l1.start AND l1.end
  OR
  l1.start BETWEEN  l2.start AND l2.end
  )

Yes, that's another way to calculate an overlap. However, it turns out to not 
be that fast. 
The problem is that OR there, which causes a bitmap index scan, as the leaf of 
a nested loop join, 
which can be rather slow.


Ok , than splitting these checks in 2 Queries with UNION  is better.   
But I often read that BETWEEN is faster than using 2 comparison operators.
Here I guess that a combined index on (start,end) makes sense:

..
WHERE l2.start BETWEEN  l1.start AND l1.end
..
UNION
..
WHERE l1.start BETWEEN  l2.start AND l2.end
..


The first clause being equivalent to

AND l1.start = l2.end
AND l1.end   = l2.start
AND l1.start = l2.start

I don't know how you have to deal the limit conditions...


Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very specialised query

2009-03-30 Thread Marc Mamin
Hello Matthew,

Another idea:

Are your objects limited to some smaller ranges of your whole interval ?
If yes you may possibly reduce the ranges to search for while using an 
additional table with the min(start) max(end) of each object...

Marc Mamin


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-26 Thread Marc Mamin
Hello,
 
I didn't try it, but following should be slightly faster:
 
COUNT( CASE WHEN field = x AND field  y THEN true END)
intead of 
SUM( CASE WHEN field = x AND field  y THEN 1 ELSE 0 END)
 
HTH,
 
Marc Mamin




From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Nikolas
Everett
Sent: Thursday, October 22, 2009 4:48 AM
To: Doug Cole
Cc: pgsql-performance
Subject: Re: [PERFORM] optimizing query with multiple aggregates


So you've got a query like: 

SELECT SUM(CASE WHEN field = 0 AND field  10 THEN 1 ELSE 0 END) as
zeroToTen,
  SUM(CASE WHEN field = 10 AND field  20 THEN 1 ELSE 0
END) as tenToTwenty,
  SUM(CASE WHEN field = 20 AND field  30 THEN 1 ELSE 0
END) as tenToTwenty,
...
FROM  bigtable




My guess is this forcing a whole bunch of if checks and your getting cpu
bound.  Could you try something like:


SELECT SUM(CASE WHEN field = 0 AND field  10 THEN count ELSE 0 END) as
zeroToTen,
  SUM(CASE WHEN field = 10 AND field  20 THEN count ELSE 0
END) as tenToTwenty,
  SUM(CASE WHEN field = 20 AND field  30 THEN count ELSE 0
END) as tenToTwenty,
...
FROM  (SELECT field, count(*) FROM bigtable GROUP BY field)


which will allow a hash aggregate?  You'd do a hash aggregate on the
whole table which should be quick and then you'd summarize your bins.


This all supposes that you don't want to just query postgres's column
statistics.


On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole dougc...@gmail.com wrote:


On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure
mmonc...@gmail.com wrote:

 On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole
dougc...@gmail.com wrote:
  I have a reporting query that is taking nearly all of it's
time in aggregate
  functions and I'm trying to figure out how to optimize it.
The query takes
  approximately 170ms when run with select *, but when run
with all the
  aggregate functions the query takes 18 seconds.  The
slowness comes from our
  attempt to find distribution data using selects of the form:
 
  SUM(CASE WHEN field = x AND field  y THEN 1 ELSE 0 END)
 
  repeated across many different x,y values and fields to
build out several
  histograms of the data.  The main culprit appears to be the
CASE statement,
  but I'm not sure what to use instead.  I'm sure other people
have had
  similar queries and I was wondering what methods they used
to build out data
  like this?

 have you tried:

 count(*) where field = x AND field  y;

 ??

 merlin


Unless I'm misunderstanding you, that would require breaking
each bin
into a separate sql statement and since I'm trying to calculate
more
than 100 bins between the different fields any improvement in
the
aggregate functions would be overwhelmed by the cost of the
actual
query, which is about 170ms.
Thanks,
Doug


--
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance





[PERFORM] CREATE INDEX as bottleneck

2010-11-11 Thread Marc Mamin
Hello,

in the last years, we have successfully manage to cope with our data
growth 
using partitioning and splitting large aggregation tasks on multiple
threads.
The partitioning is done logically by our applicationn server, thus
avoiding trigger overhead.

There are a few places in our data flow where we have to wait for index
creation before being able to distribute the process on multiple threads
again.

With the expected growth, create index will probably become a severe
bottleneck for us.

Is there any chance to see major improvement on it in a middle future ?
I guess the question is naive, but why can't posgres use multiple
threads for large sort operation ?


best regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] CREATE INDEX as bottleneck

2010-11-11 Thread Marc Mamin
No, CONCURRENTLY is to improve table availability during index creation, but it 
degrades the performances.

best regards,

Marc Mamin


-Original Message-
From: Alex Hunsaker [mailto:bada...@gmail.com] 
Sent: Donnerstag, 11. November 2010 19:55
To: Marc Mamin
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] CREATE INDEX as bottleneck

On Thu, Nov 11, 2010 at 06:41, Marc Mamin m.ma...@intershop.de wrote:
 There are a few places in our data flow where we have to wait for index
 creation before being able to distribute the process on multiple threads
 again.

Would CREATE INDEX CONCURRENTLY help here?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] anti-join chosen even when slower than old plan

2010-11-13 Thread Marc Mamin
Hello,

Just a short though:

Is it imaginable to compare the prognoses of the plans with the actual
results 
and somehow log the worst cases ? 

a) to help the DBA locate bad statistics and queries
b) as additional information source for the planner

This could possibly affect parameters of your formula on the fly.

best regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] hashed subplan 5000x slower than two sequential operations

2010-12-08 Thread Marc Mamin


Hello,

are the table freshly analyzed, with a sufficient default_statistics_target ?

You may try to get a better plan while rewriting the query as an UNION to get 
rid of the OR clause.
Something like (not tested):

SELECT contexts.context_key
FROM contexts
 JOIN articles
 ON (articles.context_key=contexts.context_key)
WHERE contexts.parent_key = 392210
AND articles.indexed


UNION
SELECT context_key
FROM
(
  SELECT contexts.context_key
  FROM contexts JOIN collection_data ON ( contexts.context_key = 
collection_data .context_key)
  WHERE collection_data.collection_context_key = 392210)
) foo JOIN articles ON (foo.context_key=contexts.context_key)
WHERE articles.indexed
;


I've had one similar problem where there was no way for the planner to notice 
that the query would systematically return very few rows. Here, my last resort 
was to disable some planner methods within the given transaction.

regards,

Marc Mamin

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Shrirang Chitnis
Gesendet: Mi 12/8/2010 8:05
An: Bryce Nesbitt; pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] hashed subplan 5000x slower than two sequential 
operations
 
Bryce,

The two queries are different:

You are looking for contexts.context_key in first query

WHERE (contexts.parent_key = 392210
  OR contexts.context_key IN
 (SELECT collection_data.context_key
 FROM collection_data
  WHERE collection_data.collection_context_key = 392210)


but second query has context.parent_key

WHERE (contexts.parent_key = 392210 OR contexts.parent_key IN
(392210,392210,395073,1304250))

Is the contexts.context_key an indexed field? contexts.parent_key certainly 
seems to be.


HTH,


Shrirang Chitnis
Sr. Manager, Applications Development
HOV Services


Office: (866) 808-0935 Ext: 39210
shrirang.chit...@hovservices.com
www.hovservices.com

The information contained in this message, including any attachments, is 
attorney privileged and/or confidential information intended only for the use 
of the individual or entity named as addressee.  The review, dissemination, 
distribution or copying of this communication by or to anyone other than the 
intended addressee is strictly prohibited.  If you have received this 
communication in error, please immediately notify the sender by replying to the 
message and destroy all copies of the original message.


-Original Message-
From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Bryce Nesbitt
Sent: Thursday, December 09, 2010 12:24 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] hashed subplan 5000x slower than two sequential operations

Can you help me understand how to optimize the following.  There's a
subplan which in this case returns 3 rows,
but it is really expensive:



=
explain analyze SELECT contexts.context_key
FROM contexts
 JOIN articles
 ON (articles.context_key=contexts.context_key)
WHERE (contexts.parent_key = 392210
  OR contexts.context_key IN
 (SELECT collection_data.context_key
 FROM collection_data
  WHERE collection_data.collection_context_key = 392210)
)
AND articles.indexed
;

QUERY PLAN

  Hash Join  (cost=83054.41..443755.45 rows=261077 width=4) (actual
time=4362.143..6002.808 rows=28 loops=1)
Hash Cond: (articles.context_key = contexts.context_key)
-  Seq Scan on articles  (cost=0.00..345661.91 rows=522136 width=4)
(actual time=0.558..3953.002 rows=517356 loops=1)
  Filter: indexed
-  Hash  (cost=69921.25..69921.25 rows=800493 width=4) (actual
time=829.501..829.501 rows=31 loops=1)
  -  Seq Scan on contexts  (cost=14.31..69921.25 rows=800493
width=4) (actual time=1.641..829.339 rows=31 loops=1)
Filter: ((parent_key = 392210) OR (hashed subplan))
SubPlan
  -  Index Scan using collection_data_context_key_index
on collection_data  (cost=0.00..14.30 rows=6 width=4) (actual
time=0.018..0.023 rows=3 loops=1)
Index Cond: (collection_context_key = 392210)
  Total runtime: 6002.976 ms
(11 rows)


=
explain analyze SELECT contexts.context_key
FROM contexts
 JOIN articles
 ON (articles.context_key=contexts.context_key)
WHERE (contexts.parent_key = 392210 OR contexts.parent_key IN
(392210,392210,395073,1304250))
AND articles.indexed
;
   QUERY PLAN

Re: [PERFORM] hashed subplan 5000x slower than two sequential operations

2010-12-08 Thread Marc Mamin

Another point: would a conditionl index help ?

on articles (context_key) where indexed

regards,

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Marc Mamin
Gesendet: Mi 12/8/2010 9:06
An: Shrirang Chitnis; Bryce Nesbitt; pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] hashed subplan 5000x slower than two sequential 
operations
 


Hello,

are the table freshly analyzed, with a sufficient default_statistics_target ?

You may try to get a better plan while rewriting the query as an UNION to get 
rid of the OR clause.
Something like (not tested):

SELECT contexts.context_key
FROM contexts
 JOIN articles
 ON (articles.context_key=contexts.context_key)
WHERE contexts.parent_key = 392210
AND articles.indexed


UNION
SELECT context_key
FROM
(
  SELECT contexts.context_key
  FROM contexts JOIN collection_data ON ( contexts.context_key = 
collection_data .context_key)
  WHERE collection_data.collection_context_key = 392210)
) foo JOIN articles ON (foo.context_key=contexts.context_key)
WHERE articles.indexed
;


I've had one similar problem where there was no way for the planner to notice 
that the query would systematically return very few rows. Here, my last resort 
was to disable some planner methods within the given transaction.

regards,

Marc Mamin

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Shrirang Chitnis
Gesendet: Mi 12/8/2010 8:05
An: Bryce Nesbitt; pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] hashed subplan 5000x slower than two sequential 
operations
 
Bryce,

The two queries are different:

You are looking for contexts.context_key in first query

WHERE (contexts.parent_key = 392210
  OR contexts.context_key IN
 (SELECT collection_data.context_key
 FROM collection_data
  WHERE collection_data.collection_context_key = 392210)


but second query has context.parent_key

WHERE (contexts.parent_key = 392210 OR contexts.parent_key IN
(392210,392210,395073,1304250))

Is the contexts.context_key an indexed field? contexts.parent_key certainly 
seems to be.


HTH,


Shrirang Chitnis
Sr. Manager, Applications Development
HOV Services


Office: (866) 808-0935 Ext: 39210
shrirang.chit...@hovservices.com
www.hovservices.com

The information contained in this message, including any attachments, is 
attorney privileged and/or confidential information intended only for the use 
of the individual or entity named as addressee.  The review, dissemination, 
distribution or copying of this communication by or to anyone other than the 
intended addressee is strictly prohibited.  If you have received this 
communication in error, please immediately notify the sender by replying to the 
message and destroy all copies of the original message.


-Original Message-
From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Bryce Nesbitt
Sent: Thursday, December 09, 2010 12:24 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] hashed subplan 5000x slower than two sequential operations

Can you help me understand how to optimize the following.  There's a
subplan which in this case returns 3 rows,
but it is really expensive:



=
explain analyze SELECT contexts.context_key
FROM contexts
 JOIN articles
 ON (articles.context_key=contexts.context_key)
WHERE (contexts.parent_key = 392210
  OR contexts.context_key IN
 (SELECT collection_data.context_key
 FROM collection_data
  WHERE collection_data.collection_context_key = 392210)
)
AND articles.indexed
;

QUERY PLAN

  Hash Join  (cost=83054.41..443755.45 rows=261077 width=4) (actual
time=4362.143..6002.808 rows=28 loops=1)
Hash Cond: (articles.context_key = contexts.context_key)
-  Seq Scan on articles  (cost=0.00..345661.91 rows=522136 width=4)
(actual time=0.558..3953.002 rows=517356 loops=1)
  Filter: indexed
-  Hash  (cost=69921.25..69921.25 rows=800493 width=4) (actual
time=829.501..829.501 rows=31 loops=1)
  -  Seq Scan on contexts  (cost=14.31..69921.25 rows=800493
width=4) (actual time=1.641..829.339 rows=31 loops=1)
Filter: ((parent_key = 392210) OR (hashed subplan))
SubPlan
  -  Index Scan using collection_data_context_key_index
on collection_data  (cost=0.00..14.30 rows=6 width=4) (actual
time=0.018..0.023 rows=3 loops=1)
Index Cond: (collection_context_key = 392210)
  Total runtime: 6002.976 ms
(11 rows)


=
explain analyze SELECT

Re: [PERFORM] UNION and bad performance

2010-12-11 Thread Marc Mamin
Hello,

 

UNION will remove all duplicates, so that the result additionally
requires to be sorted.

Anyway,  for performance issues, you should always start investigation
with explain analyze .

regards,

 

Marc Mamin

 

From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Olivier
Pala
Sent: Donnerstag, 9. Dezember 2010 11:52
To: pgsql-performance@postgresql.org
Cc: Olivier Pala
Subject: [PERFORM] UNION and bad performance

 

Hi, 

I have a performance trouble with UNION query


First I have this view :

SELECT a,b,c FROM table1, table2 WHERE jointure AND condition

Result : 150 000 records in 1~2 s



Then, I adding an UNION into the same view :

SELECT a,b,c FROM table1, table2 WHERE jointure AND condition
UNION
SELECT a,b,c FROM table3

Result : 150 200 records in 6~7 s


Why, do I have bad performance only for 200 adding records ?

Thanks

SGBD : Postgres 8.3 et 8.4 



Re: [PERFORM] Query improvement

2011-05-03 Thread Marc Mamin

 On Mon, May 2, 2011 at 10:54 PM, Mark marek.bal...@seznam.cz wrote:
  but the result have been worst than before. By the way is there a
posibility
  to create beeter query with same effect?
  I have tried more queries, but this has got best performance yet.
 
 Well, this seems to be the worst part:
 
(SELECT page_id FROM mediawiki.page WHERE page_id IN
(SELECT page_id FROM mediawiki.page
 WHERE (titlevector @@ (to_tsquery('fotbal'
OR page_id IN
(SELECT p.page_id from mediawiki.page
p,mediawiki.revision r,
(SELECT old_id FROM mediawiki.pagecontent
WHERE (textvector @@ (to_tsquery('fotbal' ss
WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id)))
 


'OR' statements often generate complicated plans. You should try to
rewrite your Query with a n UNION clause.
Using explicit joins may also help the planner:

SELECT page_id 
FROM mediawiki.page
WHERE (titlevector @@ (to_tsquery('fotbal')))

UNION 

SELECT p.page_id 
FROM mediawiki.page p 
  JOIN mediawiki.revision r on (p.page_id=r.rev_page)
  JOIN mediawiki.pagecontent ss on (r.rev_id=ss.old_id)
WHERE (ss.textvector @@ (to_tsquery('fotbal')))

HTH,

Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-13 Thread Marc Mamin
Hi,

 (see attachment)

under high concurency you may expect that your data is already in.
In such a case you better do nothing at all:

begin
  
  select dat=a_dat from t where id=a_id into test:
  
  if test is null then
  
   begin
  
insert into t (id, dat) values (a_id, a_dat);
exception
when unique_violation then
  update t set dat = a_dat where id = a_id and dat  a_dat;
  return 0;

   end;
  
  elsif not test then
  
update t set dat = a_dat where id = a_id;
  return 0;
  
  end if;

  return 1;


best regards,

Marc Mamin

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Robert Klemme
Gesendet: Di 9/13/2011 6:34
An: Marti Raudsepp
Cc: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Postgres for a data warehouse, 5-10 TB
 
On Tue, Sep 13, 2011 at 5:13 PM, Marti Raudsepp ma...@juffo.org wrote:
 On Tue, Sep 13, 2011 at 00:26, Robert Klemme shortcut...@googlemail.com 
 wrote:
 In the case of PG this particular example will work:
 1. TX inserts new PK row
 2. TX tries to insert same PK row = blocks
 1. TX commits
 2. TX fails with PK violation
 2. TX does the update (if the error is caught)

 That goes against the point I was making in my earlier comment. In
 order to implement this error-catching logic, you'll have to allocate
 a new subtransaction (transaction ID) for EVERY ROW you insert.

I don't think so.  You only need to catch the error (see attachment).
Or does this create a sub transaction?

 If
 you're going to be loading billions of rows this way, you will invoke
 the wrath of the vacuum freeze process, which will seq-scan all
 older tables and re-write every row that it hasn't touched yet. You'll
 survive it if your database is a few GB in size, but in the terabyte
 land that's unacceptable. Transaction IDs are a scarce resource there.

Certainly.  But it's not needed as far as I can see.

 In addition, such blocking will limit the parallelism you will get
 from multiple inserters.

Yes, I mentioned the speed issue.  But regardless of the solution for
MySQL's INSERT..ON DUPLICATE KEY UPDATE which Igor mentioned you
will have the locking problem anyhow if you plan to insert
concurrently into the same table and be robust.

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/



Re: [PERFORM] Postgres array parser

2011-12-14 Thread Marc Mamin
Hello,

For such cases (see below), it would be nice to have an unnest function that 
only affect the first array dimension.

Something like 

unnest(ARRAY[[1,2],[2,3]], SLICE=1)
=
unnest
--
[1,2]
[2,3]


With this function, I imagine that following sql function
might beat the plpgsql FOREACH version. 


CREATE OR REPLACE FUNCTION input_value_un (in_inputs numeric[], in_input_nr 
numeric)
   RETURNS numeric AS
$BODY$
   
  SELECT u[1][2]
  FROM unnest($1, SLICE =1) u
  WHERE u[1][1]=in_input_nr
  LIMIT 1;

$BODY$
   LANGUAGE sql IMMUTABLE;

   
   
best regards,

Marc Mamin
   

 -Original Message-
 From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-
 ow...@postgresql.org] On Behalf Of Pavel Stehule
 Sent: Dienstag, 13. Dezember 2011 15:43
 To: Aleksej Trofimov
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Postgres array parser
 
 Hello
 
 2011/12/13 Aleksej Trofimov aleksej.trofi...@ruptela.lt:
  We have tried foreach syntax, but we have noticed performance
 degradation:
  Function with for: 203ms
  Function with foreach: ~250ms:
 
  there is functions code:
  CREATE OR REPLACE FUNCTION input_value_fe(in_inputs numeric[],
 in_input_nr
  numeric)
   RETURNS numeric AS
  $BODY$
  declare i numeric[];
  BEGIN
         FOREACH i SLICE 1 IN ARRAY in_inputs
             LOOP
                  if i[1] = in_input_nr then
                     return i[2];
                  end if;
             END LOOP;
 
     return null;
  END;
  $BODY$
   LANGUAGE plpgsql VOLATILE
   COST 100;
 
  CREATE OR REPLACE FUNCTION input_value(in_inputs numeric[],
 in_input_nr
  numeric)
   RETURNS numeric AS
  $BODY$
  declare
   size int;
  BEGIN
   size = array_upper(in_inputs, 1);
     IF size IS NOT NULL THEN
 
         FOR i IN 1 .. size LOOP
             if in_inputs[i][1] = in_input_nr then
                 return in_inputs[i][2];
             end if;
         END LOOP;
     END IF;
 
     return null;
  END;
  $BODY$
   LANGUAGE plpgsql VOLATILE
   COST 100;
 
 
  On 12/13/2011 04:02 PM, Pavel Stehule wrote:
 
  Hello
 
  do you know FOREACH IN ARRAY statement in 9.1
 
  this significantly accelerate iteration over array
 
 
  http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-foreach-
 in-array/
 
 
 
  2011/12/13 Aleksej Trofimovaleksej.trofi...@ruptela.lt:
 
  Hello, I wanted to ask according such a problem which we had faced
 with.
  We are widely using postgres arrays like key-value array by doing
 like
  this:
 
  {{1,5},{2,6},{3,7}}
 
  where 1,2,3 are keys, and 5,6,7 are values. In our pgSql functions
 we are
  using self written array_input(array::numeric[], key::numeric)
 function
  which makes a loop on whole array and searches for key like
  FOR i IN 1 .. size LOOP
             if array[i][1] = key then
                 return array[i][2];
             end if;
  END LOOP;
 
  But this was a good solution until our arrays and database had
 grown. So
  now
  FOR loop takes a lot of time to find value of an array.
 
  And my question is, how this problem of performance could be
 solved? We
  had
  tried pgperl for string parsing, but it takes much more time than
 our
  current solution. Also we are thinking about self-written C++
 function,
  may
  be someone had implemented this algorithm before?
 
  you can use indexes or you can use hstore
 
  Regards
 
  Pavel Stehule
 
  --
  Best regards
 
  Aleksej Trofimov
 
 
  --
  Sent via pgsql-performance mailing list
  (pgsql-performance@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-performance
 
 
 
 It is strange - on my comp FOREACH is about 2x faster
 
 postgres=# select input_value(array(select
 generate_series(1,100)::numeric), 10);
  input_value
 -
 
 (1 row)
 
 Time: 495.426 ms
 
 postgres=# select input_value_fe(array(select
 generate_series(1,100)::numeric), 10);
  input_value_fe
 
 
 (1 row)
 
 Time: 248.980 ms
 
 Regards
 
 Pavel
 
 
 
  --
  Best regards
 
  Aleksej Trofimov
 
  UAB Ruptela
 
  Phone: +370 657 80475
 
  E-Mail: aleksej.trofi...@ruptela.lt
  Web:    http://www.ruptela.lt
 
  Ruptela - the most successful IT company in Lithuania 2011
  Ruptela - sekmingiausia Lietuvos aukštųjų technologijų įmonė 2011
 
 http://www.prezidentas.lt/lt/spaudos_centras_392/pranesimai_spaudai/ino
 vatyvus_verslas_-_konkurencingos_lietuvos_pagrindas.html
  http://www.ruptela.lt/news/37/121/Ruptela-sekmingiausia-jauna-
 aukstuju-technologiju-imone-Lietuvoje
 
 
 --
 Sent via pgsql-performance mailing list (pgsql-
 performa...@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres array parser

2011-12-14 Thread Marc Mamin
 Yes, it would be great, but I haven't found such a function, which
 splits 2 dimensional array into rows =) Maybe we'll modify existing
 function, but unfortunately we have tried hstore type and function in
 postgres and we see a significant performance improvements. So we only
 need to convert existing data into hstore and I think this is a good
 solution.



I haven't tested hstore yet, but I would be interested to find out if it still 
better perform with custom numeric aggregates on the hstore values.

I've made a short proof of concept  test with a custom key/value type to 
achieve such an aggregation.
Something like:


   SELECT x, distinct_sum( (currency,amount)::keyval ) overview  FROM ... GROUP 
BY x

   x currency amount
   a  EUR   15.0
   a  EUR5.0
   a  CHF7.5
   b  USD   12.0
   =

   x  overview
   -  
   a {(EUR,20.0), (CHF,7.5)}
   b {(USD,10.0)}


regards,

Marc Mamin

 
 On 12/14/2011 11:21 AM, Marc Mamin wrote:
  Hello,
 
  For such cases (see below), it would be nice to have an unnest
 function that only affect the first array dimension.
 
  Something like
 
  unnest(ARRAY[[1,2],[2,3]], SLICE=1)
  =
  unnest
  --
  [1,2]
  [2,3]
 
 
  With this function, I imagine that following sql function
  might beat the plpgsql FOREACH version.
 
 
  CREATE OR REPLACE FUNCTION input_value_un (in_inputs numeric[],
 in_input_nr numeric)
  RETURNS numeric AS
  $BODY$
 
 SELECT u[1][2]
 FROM unnest($1, SLICE =1) u
 WHERE u[1][1]=in_input_nr
 LIMIT 1;
 
  $BODY$
  LANGUAGE sql IMMUTABLE;
 
 
 
  best regards,
 
  Marc Mamin
 
 
  -Original Message-
  From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-
 performance-
  ow...@postgresql.org] On Behalf Of Pavel Stehule
  Sent: Dienstag, 13. Dezember 2011 15:43
  To: Aleksej Trofimov
  Cc: pgsql-performance@postgresql.org
  Subject: Re: [PERFORM] Postgres array parser
 
  Hello
 
  2011/12/13 Aleksej Trofimovaleksej.trofi...@ruptela.lt:
  We have tried foreach syntax, but we have noticed performance
  degradation:
  Function with for: 203ms
  Function with foreach: ~250ms:
 
  there is functions code:
  CREATE OR REPLACE FUNCTION input_value_fe(in_inputs numeric[],
  in_input_nr
  numeric)
RETURNS numeric AS
  $BODY$
  declare i numeric[];
  BEGIN
  FOREACH i SLICE 1 IN ARRAY in_inputs
  LOOP
   if i[1] = in_input_nr then
  return i[2];
   end if;
  END LOOP;
 
  return null;
  END;
  $BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
 
  CREATE OR REPLACE FUNCTION input_value(in_inputs numeric[],
  in_input_nr
  numeric)
RETURNS numeric AS
  $BODY$
  declare
size int;
  BEGIN
size = array_upper(in_inputs, 1);
  IF size IS NOT NULL THEN
 
  FOR i IN 1 .. size LOOP
  if in_inputs[i][1] = in_input_nr then
  return in_inputs[i][2];
  end if;
  END LOOP;
  END IF;
 
  return null;
  END;
  $BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
 
 
  On 12/13/2011 04:02 PM, Pavel Stehule wrote:
  Hello
 
  do you know FOREACH IN ARRAY statement in 9.1
 
  this significantly accelerate iteration over array
 
 
  http://www.depesz.com/index.php/2011/03/07/waiting-for-9-1-
 foreach-
  in-array/
 
 
  2011/12/13 Aleksej Trofimovaleksej.trofi...@ruptela.lt:
  Hello, I wanted to ask according such a problem which we had
 faced
  with.
  We are widely using postgres arrays like key-value array by
 doing
  like
  this:
 
  {{1,5},{2,6},{3,7}}
 
  where 1,2,3 are keys, and 5,6,7 are values. In our pgSql
 functions
  we are
  using self written array_input(array::numeric[], key::numeric)
  function
  which makes a loop on whole array and searches for key like
  FOR i IN 1 .. size LOOP
  if array[i][1] = key then
  return array[i][2];
  end if;
  END LOOP;
 
  But this was a good solution until our arrays and database had
  grown. So
  now
  FOR loop takes a lot of time to find value of an array.
 
  And my question is, how this problem of performance could be
  solved? We
  had
  tried pgperl for string parsing, but it takes much more time than
  our
  current solution. Also we are thinking about self-written C++
  function,
  may
  be someone had implemented this algorithm before?
 
  you can use indexes or you can use hstore
 
  Regards
 
  Pavel Stehule
 
  --
  Best regards
 
  Aleksej Trofimov
 
 
  --
  Sent via pgsql-performance mailing list
  (pgsql-performance@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-performance
 
  It is strange - on my comp FOREACH is about 2x faster
 
  postgres=# select input_value(array(select
  generate_series(1,100)::numeric), 10);
input_value
  -
 
  (1 row)
 
  Time: 495.426 ms
 
  postgres=# select input_value_fe(array(select
  generate_series(1,100

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Marc Mamin
Yes, but it should become a bit slower if you fix your code :-)

  where t_imp.id is null and test.id=t_imp.id;
  =
  where t_imp.id is not null and test.id=t_imp.id;

and a partial index on matching rows might help (should be tested):

 (after the first updat)
 create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is 
not null.

regards,
Marc Mamin

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Jochen Erwied
Gesendet: Sa 1/7/2012 12:57
An: anto...@inaps.org
Cc: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Duplicate deletion optimizations
 
Friday, January 6, 2012, 4:21:06 PM you wrote:

 Every 5 minutes, a process have to insert a few thousand of rows in this
 table, but sometime, the process have to insert an already existing row
 (based on values in the triplet (t_value, t_record, output_id). In this
 case, the row must be updated with the new count value. I've tried some
 solution given on this stackoverflow question [1] but the insertion rate
 is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 5 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100 
million different entries:

use strict;

srand time;
my $i = 0;
open FD, data.in;
for (1..5)
{
$i += rand(2000);
print FD sprintf(%d\t%d\t%d\t%d\n, $i/65536, ($i/256)%255, $i%255, 
rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id 
integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
   set id=test.id
   from test
   where 
(t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
   set count=t_imp.count
   from t_imp
   where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
   select t_value,t_record,output_id,count
  from t_imp
  where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the 
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 
GHz, table and indices stored on a SSD)

Table statistics:

relid | 14332525
schemaname| public
relname   | test
seq_scan  | 8
seq_tup_read  | 111541821
idx_scan  | 149240169
idx_tup_fetch | 117901695
n_tup_ins | 30280175
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup| 30264431
n_dead_tup| 0
last_vacuum   |
last_autovacuum   |
last_analyze  |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count  | 0
autovacuum_count  | 0
analyze_count | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



[PERFORM] How to remove a table statistics ?

2012-01-31 Thread Marc Mamin
Hello, 

I have a weird table, upon with the queries are much faster when no
statics were collected. 

Is there a way to delete statistics information for a table ?
I've tried ALTER.. set STATISTICS 0 and then run ANALYZE, but it seems
that old statistics are kept this way.
Can I delete entries directly in pg_statistic ?
(Postgresql 9.1)


 short backgroud Info:
 
 One of the table index is a GIN on a tsvector returning function, which
is very costy.
 once analyzed, the query planner often ignore this index in favour of
other one, hence triggering this function too often.
 
 I'll fix that model, but am first looking for a quick way to restore
performance on our production servers.
 
 
 best regards,
 
 Marc Mamin


Re: [PERFORM] How to remove a table statistics ?

2012-01-31 Thread Marc Mamin
Hello,
Some more tests have shown that removing the statistics just move the 
performance issue to other places.
The main issue here is a bad design, so I'd better focus on this than losing 
too much time with the current situation.
But this raises an interesting question on how/where does Postgres store 
statistics on functional indexes. 
in pg_statistics there are information on the column content, but I couldn't 
find stats on the function result which is fully computed only during the index 
creation.
I guess that the planner would need to know at least the function cost to 
weight the benefit of such an index. 
In my case I would set the function cost to 200 ...


I have also tried to reduce random_page_cost to 2, and it seems to help in a 
few cases.


(anonymized)

explain analyze
SELECT min(msoffset) as t, coalesce(pipelinecall_id,-2) as pid
 from  aserrorlist_20120125 l
 WHERE 1 = 1
 AND msoffset = 132750300
AND my_func('foo',20120125,l.id, l.header_9_10_id, l.categories_id, 
l.firstline_id) @@  to_aserrcfg_search_tsq($KUKU$lexeme_1 ! lexeme_2$KUKU$)
 group by 
ridcount,pipelinecall_id,coalesce(toplevelrid,msoffset::varchar);


without stats: http://explain.depesz.com/s/qPg
with stats: http://explain.depesz.com/s/88q

aserr_20120125_tvi: GIN Index on my_func(.,.,.,.,.,.)

best regards,

Marc Mamin

 -Original Message-
 From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-
 ow...@postgresql.org] On Behalf Of Josh Berkus
 Sent: Dienstag, 31. Januar 2012 19:44
 To: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] How to remove a table statistics ?
 
 On 1/31/12 3:50 AM, Marc Mamin wrote:
  Hello,
 
  I have a weird table, upon with the queries are much faster when no
  statics were collected.
 
  Is there a way to delete statistics information for a table ?
  I've tried ALTER.. set STATISTICS 0 and then run ANALYZE, but it
 seems
  that old statistics are kept this way.
  Can I delete entries directly in pg_statistic ?
  (Postgresql 9.1)
 
 You can, but it won't do any good; autovaccum will replace them.
 
 It would be better to fix the actual query plan issue.  If you can,
 post
 the query plans with and without statistics (EXPLAIN ANALYZE, please)
 here.
 
 --
 Josh Berkus
 PostgreSQL Experts Inc.
 http://pgexperts.com
 
 --
 Sent via pgsql-performance mailing list (pgsql-
 performa...@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] text search: tablescan cost for a tsvector

2012-02-06 Thread Marc Mamin
Hello,

I have quite systematically better performance with the text search when
I disable the statistics collection for the tsvector column.
So I wonder if such statistics ever make sense.

Here a testcase:

The table contains 200'000 tsvector, whereas the lexeme 'fooblablabla'
exists in all tsvector:
Without statistics, the planner decide as expected for the gin index.
After analyze, it switch to a table scan which is also expected, but the
query is 3 times slower.

My first thought was that the Bitmap Heap Scan was really fast as the
searched term is always at the first position.
So I repeated the test with an additional search term at the last
position, but without significant change:

(result from the 6. test below)

without analyze: http://explain.depesz.com/s/6At
with analyze:http://explain.depesz.com/s/r3B


best regards,

Marc Mamin




Here all my results, always one of the fastest from a few runs.


CREATE TABLE tsv_test
(
  id bigserial NOT NULL,
  v tsvector
);


The code to fill the table with test data can be found below


The test query:

explain analyze
select id from tsv_test where v @@ 'lexeme3179'::tsquery 
UNION ALL
select id from tsv_test where v @@ 'lexeme5'::tsquery
UNION ALL
select id from tsv_test where v @@ 'fooblablabla'::tsquery

The results

A) on first lexeme

1) without indexes without analyze:
   http://explain.depesz.com/s/bOv

2) alter table tsv_test add constraint tsv_test_pk primary key(id);
   http://explain.depesz.com/s/9QQ (same as previous);

3) create index tsv_gin on tsv_test using gin(v);
   http://explain.depesz.com/s/r4M = fastest

4) ANALYZE tsv_test (id);
   http://explain.depesz.com/s/MyC (same as previous);

5) ANALYZE tsv_test;
   http://explain.depesz.com/s/qu3S 
   

B) on lastlexeme   

6) create table  tsv_test2 as select id,
   v||'zzthisisalongerlexemethisisalongerlexeme'::tsvector 
   from tsv_test;
   
   explain analyze
   select id from tsv_test2 where v @@
'zzthisisalongerlexemethisisalongerlexeme'::tsquery 
   
   http://explain.depesz.com/s/6At 
   
   ANALYZE tsv_test2;
   
   http://explain.depesz.com/s/r3B 



test data:

insert into tsv_test (v) 
select
cast('fooblablabla' ||
' lexeme'||s%2|| ' lexeme'||s%3|| ' lexeme'||s%4||
' lexeme'||s%4|| ' lexeme'||s%5|| ' lexeme'||s%6||
' lexeme'||s%7|| ' lexeme'||s%8|| ' lexeme'||s%9||
' lexeme'||s%10 || ' lexeme2'||s%11 || ' lexeme3'||s%12 ||
' lexeme'||s%11 || ' lexeme2'||s%12 || ' lexeme3'||s%22 ||
' lexeme'||s%12 || ' lexeme2'||s%13 || ' lexeme3'||s%32 ||
' lexeme'||s%13 || ' lexeme2'||s%14 || ' lexeme3'||s%42 ||
' lexeme'||s%14 || ' lexeme2'||s%15 || ' lexeme3'||s%52 ||
' lexeme'||s%15 || ' lexeme2'||s%16 || ' lexeme3'||s%62 ||
' lexeme'||s%16 || ' lexeme2'||s%17 || ' lexeme3'||s%72 ||
' lexeme'||s%17 || ' lexeme2'||s%18 || ' lexeme3'||s%82 ||
' lexeme'||s%18 || ' lexeme2'||s%19 || ' lexeme3'||s%92 ||
' lexeme'||s%19 || ' lexeme2'||s%10 || ' lexeme3'||s%15 ||
' lexeme'||s%12 || ' lexeme2'||s%71 || ' lexeme3'||s%16 ||
' lexeme'||s%20 || ' lexeme2'||s%81 || ' lexeme3'||s%17 ||
' lexeme'||s%35 || ' lexeme2'||s%91 || ' lexeme3'||s%18 ||
' lexeme'||s%100 || ' lexeme2'||s%110 || ' lexeme3'||s%120 ||
' lexeme'||s%110 || ' lexeme2'||s%120 || ' lexeme3'||s%220 ||
' lexeme'||s%120 || ' lexeme2'||s%130 || ' lexeme3'||s%320 ||
' lexeme'||s%130 || ' lexeme2'||s%140 || ' lexeme3'||s%420 ||
' lexeme'||s%140 || ' lexeme2'||s%150 || ' lexeme3'||s%520 ||
' lexeme'||s%150 || ' lexeme2'||s%160 || ' lexeme3'||s%620 ||
' lexeme'||s%160 || ' lexeme2'||s%170 || ' lexeme3'||s%720 ||
' lexeme'||s%170 || ' lexeme2'||s%180 || ' lexeme3'||s%820 ||
' lexeme'||s%180 || ' lexeme2'||s%190 || ' lexeme3'||s%920 ||
' lexeme'||s%190 || ' lexeme2'||s%100 || ' lexeme3'||s%150 ||
' lexeme'||s%120 || ' lexeme2'||s%710 || ' lexeme3'||s%160 ||
' lexeme'||s%200 || ' lexeme2'||s%810 || ' lexeme3'||s%170 ||
' lexeme'||s%350 || ' lexeme2'||s%910 || ' lexeme3'||s%180 
as tsvector)
FROM generate_series(1,10) s
UNION ALL
select
cast('fooblablabla' ||
' thisisalongerlexemethisisalongerlexeme'||s%2|| '
thisisalongerlexemethisisalongerlexeme'||s%3|| '
thisisalongerlexemethisisalongerlexeme'||s%4||
' thisisalongerlexemethisisalongerlexeme'||s%4|| '
thisisalongerlexemethisisalongerlexeme'||s%5|| '
thisisalongerlexemethisisalongerlexeme'||s%6||
' thisisalongerlexemethisisalongerlexeme'||s%7|| '
thisisalongerlexemethisisalongerlexeme'||s%8|| '
thisisalongerlexemethisisalongerlexeme'||s%9||
' thisisalongerlexemethisisalongerlexeme'||s%10 || '
thisisalongerlexemethisisalongerlexeme2'||s%11 || '
thisisalongerlexemethisisalongerlexeme3'||s%12 ||
' thisisalongerlexemethisisalongerlexeme'||s%11 || '
thisisalongerlexemethisisalongerlexeme2'||s%12 || '
thisisalongerlexemethisisalongerlexeme3'||s%22 ||
' thisisalongerlexemethisisalongerlexeme'||s%12 || '
thisisalongerlexemethisisalongerlexeme2'||s%13 || '
thisisalongerlexemethisisalongerlexeme3'||s%32 ||
' thisisalongerlexemethisisalongerlexeme'||s%13 || '
thisisalongerlexemethisisalongerlexeme2'||s

Re: [PERFORM] text search: tablescan cost for a tsvector

2012-02-29 Thread Marc Mamin
 Von: Robert Haas [mailto:robertmh...@gmail.com]
 Gesendet: Mi 2/29/2012 7:32

  
 On Mon, Feb 6, 2012 at 6:05 AM, Marc Mamin m.ma...@intershop.de wrote:
  without analyze: http://explain.depesz.com/s/6At
  with analyze:http://explain.depesz.com/s/r3B
... 
 The problem seems to be that the cost estimator doesn't know that
 detoasting is expensive.

Hello,

Tom Lane has started a follow up thread in the hacker list.
Detoasting is indeed the main obstacle, but I've repeated my test using plain 
storage
and the planer still choose (systematically?) the slowest query.
It seems that I bumped into 2 different issues at the same time.

http://archives.postgresql.org/pgsql-hackers/2012-02/msg00896.php

Backround: 
Our reporting system offers amongst others time histograms 
combined with a FTS filtering on error occurences (imported from error logs), 
It is hence not unusual that given search terms are found within a majority of 
the documents...

best regards,

Marc Mamin


Re: [PERFORM] Performance of a large array access by position (tested version 9.1.3)

2012-06-26 Thread Marc Mamin

 On 22/06/12 09:02, Maxim Boguk wrote: 

 May be I completely wrong but I always assumed that the access speed to the 
 array element in PostgreSQL should be close to constant time.
 But in tests I found that access speed degrade as O(N) of array size.

 Is that behaviour is correct?


 From: pgsql-performance-ow...@postgresql.org On Behalf Of Jesper Krogh

 Default column storage is to compress it, and store in TOAST with large 
 values. 
 This it what is causing the shift. Try to change the column storage of the 
 column
 to EXTERNAL instead and rerun the test. 


Hello,

I've repeated your test in a simplified form:
you are right :-(

create table t1 ( _array int[]);
alter table t1 alter _array set storage external;
insert into t1 SELECT ARRAY(SELECT * FROM generate_series(1,5));

create table t2 ( _array int[]);
alter table t2 alter _array set storage external;
insert into t2 SELECT ARRAY(SELECT * FROM generate_series(1,500));

explain analyze SELECT _array[1] FROM t1;
Total runtime: 0.125 ms

explain analyze SELECT _array[1] FROM t2;
Total runtime: 8.649 ms


best regards,

Marc Mamin



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance of a large array access by position (tested version 9.1.3)

2012-06-26 Thread Marc Mamin


 -Original Message-
 From: Pavel Stehule [mailto:pavel.steh...@gmail.com]
 
 2012/6/26 Marc Mamin m.ma...@intershop.de:
 
  On 22/06/12 09:02, Maxim Boguk wrote:
 
  May be I completely wrong but I always assumed that the access
 speed to the array element in PostgreSQL should be close to constant
 time.
  But in tests I found that access speed degrade as O(N) of array
 size.
 
  Is that behaviour is correct?
 
 yes - access to n position means in postgresql - skip n-1 elements


Hmmm...

how many elements to be skipped here ?

SELECT _array[1] FROM t2;

I wonder if the time rather get spent in first retrieving the array itself 
before accessing its elements.

regards,

Marc Mamin

 
 Regards
 
 Pavel
 
 
 
  From: pgsql-performance-ow...@postgresql.org On Behalf Of Jesper
 Krogh
 
  Default column storage is to compress it, and store in TOAST with
 large values.
  This it what is causing the shift. Try to change the column storage
 of the column
  to EXTERNAL instead and rerun the test.
 
 
  Hello,
 
  I've repeated your test in a simplified form:
  you are right :-(
 
  create table t1 ( _array int[]);
  alter table t1 alter _array set storage external;
  insert into t1 SELECT ARRAY(SELECT * FROM generate_series(1,5));
 
  create table t2 ( _array int[]);
  alter table t2 alter _array set storage external;
  insert into t2 SELECT ARRAY(SELECT * FROM
 generate_series(1,500));
 
  explain analyze SELECT _array[1] FROM t1;
  Total runtime: 0.125 ms
 
  explain analyze SELECT _array[1] FROM t2;
  Total runtime: 8.649 ms
 
 
  best regards,
 
  Marc Mamin
 
 
 
  --
  Sent via pgsql-performance mailing list (pgsql-
 performa...@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL db, 30 tables with number of rows 100 (not huge) - the fastest way to clean each non-empty table and reset unique identifier column of empty ones.

2012-07-06 Thread Marc Mamin



Stanislaw Pankevich wrote:
  PostgreSQL db, 30 tables with number of rows  100 (not huge) - the 
 fastest way to clean each
 non-empty table and reset unique identifier column of empty ones 

Hello, 

2 'exotic' ideas:

- use dblink_send_query to do the job in multiple threads (I doubt this really 
could be faster)
- have prepared empty tables in a separate schema, and a garbage schema:

   ALTER TABLE x set schema garbage;
   ALTER TABLE prepared.x set schema current;

you should be ready for the next test, 

but still have to clean garbage nad moved to prepared for the next but one in 
the background

best regards,

Marc Mamin






 I wonder, what is the fastest way to accomplish this kind of task in 
 PostgreSQL. I am interested in
 the fastest solutions ever possible.

 I need the fastest cleaning strategy for such case working on PostgreSQL 
 both 8 and 9.

 I see the following approaches:

 1) Truncate each table. It is too slow, I think, especially for empty 
 tables.

 Did you actually try it? That's the king's way to performance questions!
 Truncating a single table is done in a matter of microseconds, particularly
 if it is not big.
 Do you have tens of thousands of tables?

 Actually, 10-100 tables.

 You could of course run a SELECT 1 FROM table LIMIT 1, but again I don't
 think that this will be considerably faster than just truncating the table.
 
 Exactly this query is much faster, believe me. You can see my latest
 results on https://github.com/stanislaw/truncate-vs-count.

Ok, I believe you.

My quick tests showed that a sible truncate (including transaction and
client-server roundtrip via UNIX sockets takes some 10 to 30 milliseconds.

Multiply that with 100, and you end up with just a few seconds at most.
Or what did you measure?

I guess you run that deletion very often so that it is painful.

Still I think that the biggest performance gain is to be had by using
PostgreSQL's features (truncate several tables in one statement, ...).

Try to bend your Ruby framework!

Yours,
Laurenz Albe

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




Re: [PERFORM] A very long running query....

2012-07-21 Thread Marc Mamin


Hello,
isn't the first test superfluous here ?

   where extract('day' from message_copies.msg_date_rec) = 17
   and date_trunc('day',message_copies.msg_date_rec) = '2012-07-17'


 Here is the index:
 
 CREATE INDEX idx_message_copies_wk2_date_src_pos_partial
   ON feed_all_y2012m07.message_copies_wk2
   USING btree
   (date_trunc('day'::text, msg_date_rec),
   src_id,
   (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || 
 pos_georef4::text))
 TABLESPACE archive
   WHERE (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || 
 pos_georef4::text) IS NOT NULL 
   OR NOT (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || 
 pos_georef4::text) = ''::text;


the georef test can be simplified using coalesce:

  and (message_copies.pos_georef1 || message_copies.pos_georef2 || 
 message_copies.pos_georef3 || message_copies.pos_georef4)  ''
  and not (message_copies.pos_georef1 || message_copies.pos_georef2 || 
 message_copies.pos_georef3 || message_copies.pos_georef4) is null
  =
  and coaesce ( 
(message_copies.pos_georef1 || message_copies.pos_georef2 || 
message_copies.pos_georef3 || message_copies.pos_georef4), 
 '')  ''
  
In order to avoid this test at query time you might add a boolean column   
message_copies.pos.has_georef,
and keep it up to date  with a before insert or update trigger. This will allow 
to shorten your index definition and simplify the planner task a little bit.
Moreover it will fasten your query in cases when the index don't get used.

As Tom already mentioned it, it may make sense not to concatenate the georef 
within the index, but keep them separated, or even keep them in different 
indexes.
Which is the best depend on the other queries running against this table
  
HTH,

Marc Mamin
  


-Original Message-
From: pgsql-performance-ow...@postgresql.org on behalf of Ioannis 
Anagnostopoulos
Sent: Sat 7/21/2012 1:56 AM
To: Tom Lane
Cc: Claudio Freire; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] A very long running query
 
On 21/07/2012 00:10, Tom Lane wrote:
 Claudio Freire klaussfre...@gmail.com writes:
 Looking at this:
 -  Index Scan using
 idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
 message_copies  (cost=0.00..19057.93 rows=52 width=32) (actual
 time=62.124..5486270.845 rows=387524 loops=1)
   Index Cond: ((date_trunc('day'::text,
 msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
 AND (src_id = 1))
   Filter: ((date_part('day'::text,
 msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text
 || (pos_georef2)::text) || (pos_georef3)::text) ||
 (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text ||
 (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
  ''::text))
 I think the real problem is that the planner has no hope of doing
 anything very accurate with such an unwieldy filter condition.  I'd look
 at ways of making the filter conditions simpler, perhaps by recasting
 the data representation.  In particular, that's a horridly bad way of
 asking whether some columns are empty, which I gather is the intent.
 If you really want to do it just like that, creating an index on the
 concatenation expression would guide ANALYZE to collect some stats about
 it, but it would probably be a lot more efficient to put together an AND
 or OR of tests on the individual columns.

   regards, tom lane
So what you suggest is to forget all together the concatenation of the 
georef1/2/3/4 and instead alter my query with something like:

georef1 is not null and not georeg1 = ''etc for georef2 3 and 4

That would require to alter my index and have the four georef columns 
separately in it and not as a concatenation and so on for the partial 
index part. And a final thing, you seem to imply that the indexes are 
used by the analyser to collect statistics even if they are not used. So 
an index serves not only as a way to speed up targeted queries but also 
to provide better statistics to the analyzer?

Kind Regards
Yiannis

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




Re: [PERFORM] Are bitmap index scans slow to start?

2013-02-28 Thread Marc Mamin

Rebuilding the index might help, as it would put all the leaf pages holding 
values for session_id=27 adjacent to each other, so they would read from disk 
faster.  But with a name like session_id, I don't know how long such 
clustering would last though.

If I'm right about the index disk-read time, then switching to a plain index 
scan rather than a bitmap index scan would make no difference--either way the 
data has to come off the disk.



I'd prefer a
strategy that allowed fast performance the first time, rather than slow the
first time and extremely fast subsequently.

Hello,

if the index is only used to locate rows for single session_id, you may 
consider split it in a set of partial indexes.

e.g.
create index i_0 on foo where session_id%4 =0;
create index i_1 on foo where session_id%4 =1;
create index i_2 on foo where session_id%4 =2;
create index i_3 on foo where session_id%4 =3;

(can be built in parallel using separate threads)

Then you will have to ensure that all your WHERE clauses also contain the index 
condition:

WHERE session_id = 27 AND session_id%4 =27%4

regards,

Marc Mamin


Re: [PERFORM] Trying to eliminate union and sort

2013-07-15 Thread Marc Mamin
Hello,

you may want to try starting with some CTE that first retrieve required subsets.
adding ordering  within those CTE might also improve the timing of following 
sort/join operations.
(sorry for the top posting)

Something like:

WITH T1 AS (
  SELECT id, typeid, backup_id, mycolumn1, mycolumn2
  FROM table1 WHERE t.external_id IN ('6544', '2234', '2', '4536')
  ORDER BY mycolumn2, id ?
 ),
 
TYPES AS (SELECT DISTINCT typeid FROM T1),

T3_OTHERS AS ( SELECT id, otherid FROM table3 JOIN TYPES ON table3.id = 
TYPES.typeid
   -- Order BY id ?  
   ), 

SELECT
T1.id,
T1.mycolumn1,
T3_OTHERS.otherid,
T3_2.otherid,
T1.mycolumn2 AS mycolumn2

FROM T1 
LEFT OUTER JOIN T3_OTHERS ON T1.typeid = T3_OTHERS.id
LEFT OUTER JOIN table2 t2 ON (t2.real_id = T1.backup_id OR t2.real_id = t.id
LEFT OUTER JOIN table3 T3_2  ON t2.third_id = T3_2.id
ORDER BY T1.mycolumn2,T1.id

regards,
Marc Mamin


Von: pgsql-performance-ow...@postgresql.org 
[pgsql-performance-ow...@postgresql.org]quot; im Auftrag von quot;Brian 
Fehrle [bri...@consistentstate.com]
Gesendet: Montag, 15. Juli 2013 18:12
An: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Trying to eliminate union and sort

On 07/12/2013 04:43 PM, Josh Berkus wrote:
 As for the counts on the tables:
 table1  3,653,472
 table2  2,191,314
 table325,676,589

 I think it's safe to assume right now that any resulting joins are not
 one-to-one
 Hmmm?  How is doing a subselect in the SELECT clause even working, then?

Oh my, this is sad. the query in all returns 9,955,729 rows, so the sub
queries are run on each of these resulting rows, however in this entire
result set, subquery 1 returns 16 distinct rows, subquery 2 returns 63
different rows, but those sub queries are run over 10 million times to
return these few distinct rows. So it's running many times, but
returning the same small set of data over and over again.

- Brian F




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)

2013-07-19 Thread Marc Mamin

 SELECT * FROM FullTextSearch WHERE content_tsv_gin @@
 plainto_tsquery('english', 'good');

 It's slow ( 30 sec.) for some GB (27886 html files, originally 73 MB zipped).
 The planner obviously always chooses table scan


Hello,

A probable reason for the time difference is the cost for decompressing toasted 
content.
At lest in 8.3, the planner was not good at estimating it.

I'm getting better overall performances since I've stopped collect statistic on 
tsvectors.
An alternative would have been to disallow compression on them.

I'm aware this is a drastic way and would not recommend it without testing. The 
benefit may depend on the type of data you are indexing.
In our use case these are error logs with many java stack traces, hence with 
many lexemes poorly discriminative.

see: http://www.postgresql.org/message-id/27953.1329434...@sss.pgh.pa.us
as a comment on
http://www.postgresql.org/message-id/c4dac901169b624f933534a26ed7df310861b...@jenmail01.ad.intershop.net

regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)

2013-07-20 Thread Marc Mamin



Von: Stefan Keller [sfkel...@gmail.com]
Gesendet: Samstag, 20. Juli 2013 01:55

Hi Marc

Thanks a lot for your hint!

You mean doing a SET track_counts (true); for the whole session?

No, 
I mean 

ALTER TABLE table ALTER  ts_vector_column SET STATISTICS 0;

And remove existing statistics

DELETE FROM pg_catalog.pg_statistic 
where starelid='table':: regclass
AND staattnum = (SELECT attnum FROM pg_attribute
WHERE attrelid = 'table':: regclass
AND  attname  =  'ts_vector_column'::name
)

But you should first try to find out which proportion of your ts queries are 
faster 
when using a table scan as they will probably not happen anymore afterwards !
(Except if further columns on your table 'FullTextSearch' are considered by the 
planner)




That would be ok if it would be possible just for the gin index.

It's obviously an issue of the planner estimation costs.
The data I'm speaking about (movies) has a text attribute which has
a length of more than 8K so it's obviously having to do with
detoasting.
But the thoughts about @@ operators together with this GIN index seem
also to be valid.

I hope this issue is being tracked in preparation for 9.3.

Regards, Stefan


2013/7/19 Marc Mamin m.ma...@intershop.de:

 SELECT * FROM FullTextSearch WHERE content_tsv_gin @@
 plainto_tsquery('english', 'good');

 It's slow ( 30 sec.) for some GB (27886 html files, originally 73 MB 
 zipped).
 The planner obviously always chooses table scan


 Hello,

 A probable reason for the time difference is the cost for decompressing 
 toasted content.
 At least in 8.3, the planner was not good at estimating it.

 I'm getting better overall performances since I've stopped collect statistic 
 on tsvectors.
 An alternative would have been to disallow compression on them.

 I'm aware this is a drastic way and would not recommend it without testing. 
 The benefit may depend on the type of data you are indexing.
 In our use case these are error logs with many java stack traces, hence with 
 many lexemes poorly discriminative.

 see: http://www.postgresql.org/message-id/27953.1329434...@sss.pgh.pa.us
 as a comment on
 http://www.postgresql.org/message-id/c4dac901169b624f933534a26ed7df310861b...@jenmail01.ad.intershop.net

 regards,

 Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] stable and immutable functions in GROUP BY clauses.

2013-08-26 Thread Marc Mamin
Hello,

Stable and immutable functions do not improve performance when used within the 
GROUP BY clause.
Here, the function will be called for each row.

To avoid it, I can replace the funtion by its arguments within GROUP BY.

Maybe this hint is worth a note within the documentation on Function Volatility.

I have the situation where queries are generating by the application and it 
would be a pain to extend the query builder
in order to avoid this performance issue.
So I wonder if it would be possible for the query planner to recognize such 
cases and optimize the query internally ?

best regards,
Marc Mamin


here an example to highlight possible performance loss:

create temp table ref ( i int, r int);
create temp table val ( i int, v int);

insert into ref select s,s%2 from generate_series(1,1)s;
insert into val select s,s%2 from generate_series(1,1)s;

create or replace function getv(int) returns int as 
$$ select v+1 from val where i=$1; $$ language SQL stable;

explain analyze select  getv(r) from ref group by r;
Total runtime: 5.928 ms

explain analyze select  getv(r) from ref group by getv(r);
Total runtime: 3980.012 ms

-- and more reasonably with an index:

create unique index val_ux on val(i);

explain analyze select  getv(r) from ref group by r;
Total runtime: 4.278 ms

explain analyze select  getv(r) from ref group by getv(r);
Total runtime: 68.758 ms


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] stable and immutable functions in GROUP BY clauses.

2013-09-16 Thread Marc Mamin

 
 Hello,
 
 Stable and immutable functions do not improve performance when used within 
 the GROUP BY clause.
 Here, the function will be called for each row.
 
 To avoid it, I can replace the funtion by its arguments within GROUP BY.

Shame on me !
This is of course bullsh...  It has nothing to do with immutability and can 
only applies to few cases

e.g: it's fine for select x+1 ... group by  x,
but not forselect x^2 ... group by  x

Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] CREATE TABLE AS WITH FREEZE ?

2013-11-13 Thread Marc Mamin
Hello,

Does anything speaks again adding a  WITH FREEZE option to CREATE TABLE AS ,
similar to the new COPY FREEZE feature ?

best regards,

Marc Mamin





[PERFORM] fragmention issue with ext4: e4defrag?

2014-07-02 Thread Marc Mamin
Hello,

Has anyone some experience using defragmentation tools on Linux against 
tablespaces ?

we are facing fragmentation problems with postgres instances having a few TB of 
data.
( RAID 5 )

I/O througput decreased from 300MB/s to 160.


- We first moved some schemas to separate servers.
  After that we still have  150'000 tables in 1.5 TB
  
- Now we are in the process of vacuuming FULL historical tables which are not 
written anymore. 
  This seems to improve the I/O considerably
  
Our remaining issue is that the free space fragmentíon is still suboptimal 
so that fragmention will probably start again soon.

Would it make sense to use a tool like e4defrag 
(http://www.linux.org/threads/online-defragmentation.4121/)
in order to defrag the free space ?
And how safe is it to use such a tool against a running postgres instance?



many thanks,

Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Building multiple indexes on one table.

2014-07-23 Thread Marc Mamin
Von: pgsql-performance-ow...@postgresql.org 
[pgsql-performance-ow...@postgresql.org]quot; im Auftrag von quot;Claudio 
Freire [klaussfre...@gmail.com]
Gesendet: Freitag, 18. Juli 2014 01:21
An: Chris Ruprecht
Cc: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Building multiple indexes on one table.

On Thu, Jul 17, 2014 at 7:47 PM, Chris Ruprecht ch...@cdrbill.com wrote:
 Is there any way that I can build multiple indexes on one table without 
 having to scan the table multiple times? For small tables, that's probably 
 not an issue, but if I have a 500 GB table that I need to create 6 indexes 
 on, I don't want to read that table 6 times.
 Nothing I could find in the manual other than reindex, but that's not 
 helping, since it only rebuilds indexes that are already there and I don't 
 know if that reads the table once or multiple times. If I could create 
 indexes inactive and then run reindex, which then reads the table once, I 
 would have a solution. But that doesn't seem to exist either.

Just build them with separate but concurrent connections, and the
scans will be synchronized so it will be only one.

Btw, reindex rebuilds one index at a time, so what I do is issue
separate reindex for each index in parallel, to avoid the repeated
scans as well.

Just make sure you've got the I/O and CPU capacity for it (you'll be
writing many indexes at once, so there is a lot of I/O).

Index creation on large tables are mostly CPU bound as long as no swap occurs.
I/O may be an issue when all your indexes are similar; e.g. all on single int4 
columns.
in other cases the writes will not all take place concurrently.
To reduce I/O due to swap, you can consider increasing maintenance_work_mem on 
the connextions/sessionns
that build the indexes.

regards,

Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Cursor + upsert (astronomical data)

2014-07-27 Thread Marc Mamin
[Craig]
If you haven't looked at clustering algorithms yet, you might want to do so.
Your problem is a special case of clustering, where you have a large number
of small clusters.  A good place to start is the overview on Wikipedia:
http://en.wikipedia.org/wiki/Cluster_analysis

According to this list, your method is similar to 
http://en.wikipedia.org/wiki/Basic_sequential_algorithmic_scheme,
but with what seems to be some logical errors.

The simplest approach I could think of is that I process each row of the 3e8 
rows
sequentially and ask:
Do I have an identifier in the radius of 1 arcsec?
No: Generate one and assign me to it.
Yes: Update it and assigne me to it. The update is done as weighted average – 
I keep the
number of how many observations the identifier has been computed. The result 
is that the
identifier will have average coordinates of all the observations it identifies 
– it will
be the center.


Let say you have 2 single measures on a line at arcsec 1 and 2.1. which hence 
correspond to 2 ipix_cat.
Now add a new measure at 1.9: as you choose any of the possible adjacent 
ipix_cat whitout considering the least distance, you may end with an ipix_at at 
1.45 which is at less than 1 arcsec then the next one.
   Moreover you have raised the weight of both ipix_cat.
 Which increases the lock probability when trying a nutlithreaded upgrade.

The max distance between 2 observations belonging to a same ipix_cat tends to 2 
arcsec with your method. If this is ok, you should probbaly modify your method 
so that the 2 first points of my example would have megred to a single 
ipix_cat. You could use your weigth for this: increase your search radius to 
2arcsec and then reject the candidates located between 1 and 2 arsec depending 
on their weight. The additional work load might be compensated by the smaller 
number of ipix_cat that woul will have.







Re: [PERFORM] query a table with lots of coulmns

2014-09-20 Thread Marc Mamin
At first, thanks for your fast and comprehensive help.

The structure of my cache table is

a text , b text NOT NULL , c text , d text , e timestamp without
timezone DEFAULT now(),  f text, s1 integer DEFAULT 0, s2 integer
DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0


additional constraints: primary key (b)  , Unique(b), Unique(a)
Indexes : Index on a, Index on b

This looks redundant. e.g. you don't need a unique index on b if you already 
have a primary key on it.
Can you post the complete table definition ?

...

One remark which might help: overall 90 - 95 % of the s1-s512 columns
are 0. I am only interested in columns not equals 0. Perhaps it would
make sense to use and array of json and enumerate only values not equals 0.

Could you change that to replace 0 values with NULLs? 
This would greatly reduce your table space as Postgres is very efficient about 
NULLs storage:
It marks all null values in a bit map within the row header so you just need 
about one bit per null
instead of 4 bytes for zeros, and hence get rid of your I/O issue.

regards,

Marc Mamin

Von: pgsql-performance-ow...@postgresql.org 
[pgsql-performance-ow...@postgresql.org]quot; im Auftrag von quot;Björn 
Wittich [bjoern_witt...@gmx.de]
Gesendet: Samstag, 20. September 2014 09:19
An: Josh Berkus; pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] query a table with lots of coulmns

At first, thanks for your fast and comprehensive help.

The structure of my cache table is

a text , b text NOT NULL , c text , d text , e timestamp without
timezone DEFAULT now(),  f text, s1 integer DEFAULT 0, s2 integer
DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0

additional constraints: primary key (b)  , Unique(b), Unique(a)
Indexes : Index on a, Index on b

This table has 30 Mio rows ( will increase to 50 Mio) in future

My working table is

b text, g integer

Indexes on b and c


This table has 5 Mio rows

Scenario:

What I want to achieve :

SELECT s1,s2,s3,...s512,g,d  from worktable INNER JOIN cachetable
USING(b) ORDER BY g


The inner join will match at least 95 % of columns of the smaller
worktable in this example 4,75 mio rows.

Running this query takes several hours until I receive the first
results. Query analyzing shows that the execution plan is doing 2 seq
table scans on cache and work table.


When I divide this huge statement into

SELECT s1,s2,s3,...s512,g,d  from worktable INNER JOIN cachetable
USING(b) WHERE g BETWEEN 1 and 1 ORDER BY g, SELECT
s1,s2,s3,...s512,g,d  from worktable INNER JOIN cachetable USING(b)
WHERE g BETWEEN 10001 and 2 ORDER BY g, 

(I can do this because g i unique and continous id from 1 to N)

The result is fast but fireing parallel requests (4-8 times parallel)
slows down the retrieval.

Execution plan changes when adding BETWEEN 1 and 1 to use the indexes.



One remark which might help: overall 90 - 95 % of the s1-s512 columns
are 0. I am only interested in columns not equals 0. Perhaps it would
make sense to use and array of json and enumerate only values not equals 0.

Statistics on the large table:
table size: 80 GB
toast-tablesize: 37 GB
size of indexes: 17 GB


Thanks for your help and ideas

Björn





Am 19.09.2014 23:40, schrieb Josh Berkus:
 On 09/19/2014 04:51 AM, Björn Wittich wrote:
 I am relatively new to postgres. I have a table with 500 coulmns and
 about 40 mio rows. I call this cache table where one column is a unique
 key (indexed) and the 499 columns (type integer) are some values
 belonging to this key.

 Now I have a second (temporary) table (only 2 columns one is the key of
 my cache table) and I want  do an inner join between my temporary table
 and the large cache table and export all matching rows. I found out,
 that the performance increases when I limit the join to lots of small
 parts.
 But it seems that the databases needs a lot of disk io to gather all 499
 data columns.
 Is there a possibilty to tell the databases that all these colums are
 always treated as tuples and I always want to get the whole row? Perhaps
 the disk oraganization could then be optimized?
 PostgreSQL is already a row store, which means by default you're getting
 all of the columns, and the columns are stored physically adjacent to
 each other.

 If requesting only 1 or two columns is faster than requesting all of
 them, that's pretty much certainly due to transmission time, not disk
 IO.  Otherwise, please post your schema (well, a truncated version) and
 your queries.

 BTW, in cases like yours I've used a INT array instead of 500 columns to
 good effect; it works slightly better with PostgreSQL's compression.




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org

Re: [PERFORM] Why is GIN index slowing down my query?

2015-02-02 Thread Marc Mamin
AlexK987 alex.cue@gmail.com writes:
 I've created a GIN index on an INT[] column, but it slows down the selects.
 Here is my table:

 create table talent(person_id INT NOT NULL,
 skills INT[] NOT NULL);

 insert into talent(person_id, skills)
 select generate_series, array[0, 1] || generate_series
 from generate_series(3, 1048575);

 create index talent_skills on talent using gin(skills);

 analyze talent;

 Here is my select:

 explain analyze 
 select * from talent 
 where skills @ array[1, 15]

Well, that's pretty much going to suck given that data distribution.
Since 1 is a member of every last entry, the GIN scan will end up
examining every entry, and then rejecting all of them as not being
true subsets of [1,15].  

This is equivalent and fast:

explain analyze
WITH rare AS (
 select * from talent
 where skills @ array[15])
select * from rare
 where skills @ array[1]
 -- (with changed operator)

You might variate your query according to an additional table that keeps the 
occurrence count of all skills.
Not really pretty though.

I wonder if in such cases, the Bitmap Index Scan could discard entries that 
would result in a table scan
and use them only in the recheck part:

explain
 select * from talent 
 where skills @ array[1]
 
 Seq Scan on talent  (cost=0.00..21846.16 rows=1048573 width=37)
   Filter: (skills @ '{1}'::integer[])


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why is GIN index slowing down my query?

2015-02-02 Thread Marc Mamin
AlexK987 alex.cue@gmail.com writes:
 I've created a GIN index on an INT[] column, but it slows down the selects.
 Here is my table:

 create table talent(person_id INT NOT NULL,
 skills INT[] NOT NULL);

 insert into talent(person_id, skills)
 select generate_series, array[0, 1] || generate_series
 from generate_series(3, 1048575);

 create index talent_skills on talent using gin(skills);

 analyze talent;

 Here is my select:

 explain analyze 
 select * from talent 
 where skills @ array[1, 15]

Well, that's pretty much going to suck given that data distribution.
Since 1 is a member of every last entry, the GIN scan will end up
examining every entry, and then rejecting all of them as not being
true subsets of [1,15].  

This is equivalent and fast:

explain analyze
WITH rare AS (
 select * from talent 
 where skills @ array[15])
select * from rare
 where skills @ array[1]
 -- (with changed operator)

You might variate your query according to an additional table that keeps the 
occurrence count of all skills.
Not really pretty though.

regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [ADMIN] [PERFORM] empty string Vs NULL

2015-02-09 Thread Marc Mamin

Hi

2015-02-09 12:22 GMT+01:00 sridhar bamandlapally sridhar@gmail.com:

Hi All

We are testing our Oracle compatible business applications on PostgreSQL 
 database,

the issue we are facing is empty string Vs NULL

In Oracle '' (empty string) and NULL are treated as NULL

but, in PostgreSQL '' empty string not treated as NULL

I need some implicit way in PostgreSQL where ''empty string can be 
 treated as NULL

It is not possible in PostgreSQL.  PostgreSQL respects ANSI SQL standard - 
Oracle not.

Regards

Pavel

p.s. theoretically you can overwrite a type operators to support Oracle 
behave, but you should not be sure about unexpected negative side effects.


A clean way would be to disallow empty strings on the PG side.
This is somewhat combersome depending on how dynamic your model is
and add some last on your db though.


ALTER TABLE tablename ADD CONSTRAINT tablename_not_empty_ck
  CHECK (false= (colname1 IS NULL OR colname2 IS NULL OR colname3 IS NULL ...) 
IS NULL)

-- and to ensure compatibility with your app or migration:

CREATE OR REPLACE FUNCTION tablename_setnull_trf()
  RETURNS trigger AS
$BODY$
BEGIN
-- for all *string* columns
   NEW.colname1 = NULLIF (colname1,'');
   NEW.colname2 = NULLIF (colname2,'');
   NEW.colname3 = NULLIF (colname3,'');
RETURN NEW;
END;
$BODY$

CREATE TRIGGER tablename_setnull_tr
  BEFORE INSERT OR UPDATE
  ON tablename
  FOR EACH ROW
  EXECUTE PROCEDURE tablename_setnull_trf();

You can query the pg catalog to generate all required statements.
A possible issue is the order in which triggers are fired, when more than one 
exist for a given table:
If more than one trigger is defined for the same event on the same relation, 
the triggers will be fired in alphabetical order by trigger name
( http://www.postgresql.org/docs/9.3/static/trigger-definition.html )

regards,

Marc Mamin


Re: [PERFORM] Performance issues

2015-03-16 Thread Marc Mamin

Hi Team,

This is the EXPLAIN ANALYZE for one of the view : S_V_D_CAMPAIGN_HIERARCHY:


Rows Removed by Join Filter: 3577676116

That's quite a lot.
You're possibly missing a clause in a join, resulting in a cross join.
It is also helpful to put your result here:
http://explain.depesz.com/
regards,

Marc Mamin



===


Nested Loop  (cost=33666.96..37971.39 rows=1 width=894) (actual
time=443.556..966558.767 rows=45360 loops=1)
   Join Filter: (tp_exec.touchpoint_execution_id =
valid_executions.touchpoint_execution_id)
   Rows Removed by Join Filter: 3577676116
   CTE valid_executions
 -  Hash Join  (cost=13753.53..31711.17 rows=1 width=8) (actual
time=232.571..357.749 rows=52997 loops=1)
   Hash Cond:
((s_f_touchpoint_execution_status_history_1.touchpoint_execution_id =
s_f_touchpoint_execution_status_history.touchpoint_execution_id) AND ((max(s
_f_touchpoint_execution_status_history_1.creation_dt)) =
s_f_touchpoint_execution_status_history.creation_dt))
   -  HashAggregate  (cost=6221.56..6905.66 rows=68410 width=16)
(actual time=139.713..171.340 rows=76454 loops=1)
 -  Seq Scan on s_f_touchpoint_execution_status_history
s_f_touchpoint_execution_status_history_1  (cost=0.00..4766.04 rows=291104
width=16) (actual ti
me=0.006..38.582 rows=291104 loops=1)
   -  Hash  (cost=5493.80..5493.80 rows=135878 width=16) (actual
time=92.737..92.737 rows=136280 loops=1)
 Buckets: 16384  Batches: 1  Memory Usage: 6389kB
 -  Seq Scan on s_f_touchpoint_execution_status_history
(cost=0.00..5493.80 rows=135878 width=16) (actual time=0.012..55.078
rows=136280 loops=1)
   Filter: (touchpoint_execution_status_type_id = ANY
('{3,4}'::integer[]))
   Rows Removed by Filter: 154824
   -  Nested Loop Left Join  (cost=1955.80..6260.19 rows=1 width=894)
(actual time=31.608..3147.015 rows=67508 loops=1)
 -  Nested Loop  (cost=1955.67..6260.04 rows=1 width=776) (actual
time=31.602..2912.625 rows=67508 loops=1)
   -  Nested Loop Left Join  (cost=1955.54..6259.87 rows=1
width=658) (actual time=31.595..2713.696 rows=72427 loops=1)
 -  Nested Loop Left Join  (cost=1955.40..6259.71
rows=1 width=340) (actual time=31.589..2532.926 rows=72427 loops=1)
   -  Nested Loop Left Join  (cost=1955.27..6259.55
rows=1 width=222) (actual time=31.581..2354.662 rows=72427 loops=1)
 -  Nested Loop  (cost=1954.99..6259.24
rows=1 width=197) (actual time=31.572..2090.104 rows=72427 loops=1)
   -  Nested Loop
(cost=1954.71..6258.92 rows=1 width=173) (actual time=31.562..1802.857
rows=72427 loops=1)
 Join Filter:
(camp_exec.campaign_id = wave.campaign_id)
 Rows Removed by Join Filter:
243
 -  Nested Loop
(cost=1954.42..6254.67 rows=13 width=167) (actual time=31.551..1468.718
rows=72670 loops=1)
   -  Hash Join
(cost=1954.13..6249.67 rows=13 width=108) (actual time=31.525..402.039
rows=72670 loops=1)
 Hash Cond:
((tp_exec.touchpoint_id = tp.touchpoint_id) AND (wave_exec.wave_id =
tp.wave_id))
 -  Hash Join
(cost=1576.83..4595.51 rows=72956 width=90) (actual time=26.254..256.328
rows=72956 loops=1)
   Hash Cond:
(tp_exec.wave_execution_id = wave_exec.wave_execution_id)
   -  Seq Scan
on s_d_touchpoint_execution tp_exec  (cost=0.00..1559.56 rows=72956
width=42) (actual time=0.005..76.099 rows=72956 loops=1)
   -  Hash
(cost=1001.37..1001.37 rows=46037 width=56) (actual time=26.178..26.178
rows=46037 loops=1)
 Buckets:
8192  Batches: 1  Memory Usage: 4104kB
 -  Seq
Scan on s_d_wave_execution wave_exec  (cost=0.00..1001.37 rows=46037
width=56) (actual time=0.006..10.388 rows=46037 loops=1)
 -  Hash
(cost=212.72..212.72 rows=10972 width=26) (actual time=5.252..5.252
rows=10972 loops=1)
   Buckets: 2048
Batches: 1  Memory Usage: 645kB
   -  Seq Scan
on s_d_touchpoint tp  (cost=0.00..212.72 rows=10972 width=26) (actual
time=0.012..2.319 rows=10972 loops=1)
   -  Index Scan using
s_d_campaign_execution_idx

Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT

2015-03-05 Thread Marc Mamin


Hi, thanks for your follow-up questions.
- postgres version is 9.1.13
- the number of rows (in this latest instance) is 28,474,842
- I've clustered and vacuum-full-ed and analyzed this table frequently, 
attempting to troubleshoot this. (Running vacuum full on the whole catalog 
seems a little excessive, and unlikely to help.)

Hi,

I mean the pg_* tables. When working with temp objects and a high number of 
classes, regular vacuum may not clean them efficiently.
It is not a bad idea to run a vacuum full verbose manually on the largest of 
those from time to time to verify that they don't grow outer control.
And this normally requires a few seconds only.
The verbose output of vacuum full sometimes returns interesting information...
For the ANALYZE performance, I guess that these are the most relevant one:
 pg_statistic;
 pg_class;
 pg_attribute;
 pg_index;
 pg_constraint;

regards,

Marc Mamin



- no other processes are likely to be interfering; nothing other than 
PostgreSQL runs on this machine (except for normal OS processes and New Relic 
server monitoring service); concurrent activity in PostgreSQL is low-level and 
unrelated, and this effect is observed systematically whenever this kind of 
operation is performed on this table
- no override for this table; the system default_statistics_target is 100 (the 
default)
- yes, an ANALYZE is performed on the temp table after the COPY and before the 
INSERT
- no index on the temp table (but I'm scanning the whole thing anyway). There 
are indexes on f_foo as detailed in my original post, and I expect the PK to 
make the WHERE NOT EXISTS filtering efficient (as it filters on exactly all 
columns of the PK) ... but even if it didn't, I would expect that to only slow 
down the actual insert execution, not the EXPLAIN.
Cheers,
Gulli
On Wed, Mar 4, 2015 at 8:10 PM, Marc Mamin m.ma...@intershop.de wrote:

Hi,
we are seeing cases of EXPLAIN INSERT INTO foo SELECT ... taking over an 
 hour, with disk I/O utilization (percent of time device is busy) at 100% the 
 whole time, although I/O bandwidth is not saturated. This is on PostgreSQL 
 9.1.13.
What could cause this? Note that there is no ANALYZE. Is it possible that 
 the SELECT is actually executed, in planning the INSERT?
When executing the INSERT itself (not EXPLAIN) immediately afterwards, 
 that logs a temporary file message, but the EXPLAIN invocation does not 
 (though the disk I/O suggests that a large on-disk sort is occurring):
LOG:  temporary file: path base/pgsql_tmp/pgsql_tmp6016.0, size 
 744103936
STATEMENT:  INSERT INTO f_foo SELECT
[...]
During that actual execution, there's a lower disk I/O utilization 
 (though a higher MB/s rate).
Charts of disk I/O utilization and rate are at 
 http://postimg.org/image/628h6jmn3/ ... the solid 100% span is the EXPLAIN 
 statement, ending at 6:13:30pm, followed by the INSERT statement ending at 
 6:32pm. Metrics are collected by New Relic; their definition of I/O 
 utilization is at https://discuss.newrelic.com/t/disk-i-o-metrics/2900
Here's the EXPLAIN statement:
LOG:  duration: 3928778.823 ms  statement: EXPLAIN INSERT INTO f_foo 
 SELECT
   t_foo.fk_d1,
   t_foo.fk_d2,
   t_foo.fk_d3,
   t_foo.fk_d4,
   t_foo.fk_d5,
   t_foo.fk_d6,
   t_foo.value
FROM t_foo
WHERE NOT (EXISTS
 (SELECT *
  FROM f_foo
  WHERE f_foo.fk_d2 = t_foo.fk_d2
AND f_foo.fk_d3 = t_foo.fk_d3
AND f_foo.fk_d4 = t_foo.fk_d4
AND f_foo.fk_d5 = t_foo.fk_d5
AND f_foo.fk_d6 = t_foo.fk_d6
AND f_foo.fk_d1 = t_foo.fk_d1))
(where t_foo is a temp table previously populated using COPY, and the NOT 
 EXISTS subquery refers to the same table we are inserting into)
Here's the EXPLAIN output:
Insert on f_foo  (cost=8098210.50..9354519.69 rows=1 width=16)
  -  Merge Anti Join  (cost=8098210.50..9354519.69 rows=1 width=16)
Merge Cond: ((t_foo.fk_d2 = public.f_foo.fk_d2) AND
 (t_foo.fk_d3 = public.f_foo.fk_d3) AND
 (t_foo.fk_d4 = public.f_foo.fk_d4) AND
 (t_foo.fk_d5 = public.f_foo.fk_d5) AND
 (t_foo.fk_d6 = public.f_foo.fk_d6) AND
 (t_foo.fk_d1 = public.f_foo.fk_d1))
-  Sort  (cost=3981372.25..4052850.70 rows=28591380 width=16)
  Sort Key: t_foo.fk_d2, t_foo.fk_d3, t_foo.fk_d4, 
 t_foo.fk_d5,
t_foo.fk_d6, t_foo.fk_d1
  -  Seq Scan on t_foo  (cost=0.00..440461.80 rows=28591380
  width=16)
-  Sort  (cost=4116838.25..4188025.36 rows=28474842 width=12)
  Sort Key: public.f_foo.fk_d2, public.f_foo.fk_d3,
public.f_foo.fk_d4, public.f_foo.fk_d5

Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT

2015-03-04 Thread Marc Mamin
 analyzed?
- any index on it ?

We have about 300'000 entries in our pg_class tables, and I've never seen such 
an issue.

regards,
Marc Mamin



Re: [PERFORM] Query performance

2015-01-25 Thread Marc Mamin

I have an events table that records page views and purchases (type = 'viewed' 
or type='purchased'). I have a query that figures out people who 
bought/viewed this also bought/viewed that.

It worked fine, taking about 0.1 seconds to complete, until a few hours ago 
when it started taking hours to complete. Vacuum/analyze didn't help.  Turned 
out there was one session_id that had 400k rows in the system. Deleting that 
made the query performant again.

Is there anything I can do to make the query work better in cases like that? 
Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted correctly 
in the email. 
https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

Hello,

here  are 2 variations that should be somewhat faster.

 It seems you may have duplicate (site_id,session_id,product_id)
 which would false the result. In that case you'll need some more logic in the 
query.

 select
'82503' as product_id,
e2.site_id,
e2.product_id,
count(nullif(e2.type='viewed', false)) view_count,
count(nullif(e2.type='purchased', false)) purchase_count
  from events e1
  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
  where
e1.product_id = '82503' and
e2.product_id != '82503'
 group by  e2.product_id, e2.site_id;


 OR:

 WITH SALL as(
  select
 e2.site_id,
 e2.product_id,
 count(nullif(e2.type='viewed', false)) view_count,
 count(nullif(e2.type='purchased', false)) purchase_count
   from events e1
   join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
   where
 e1.product_id = '82503'
  group by  e2.product_id, e2.site_id
 )
 SELECT
'82503' as product_id_1,
site_id,
product_id,
view_count,
purchase_count
 FROM SALL
 WHERE product_id != '82503';


 regards,
 Marc Mamin



explain  select
   e1.product_id,
   e2.site_id,
   e2.product_id,
   count(nullif(e2.type='viewed', false)) view_count,
   count(nullif(e2.type='purchased', false)) purchase_count
 from events e1
 join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
 where
   e1.product_id = '82503' and
   e1.product_id != e2.product_id
 group by e1.product_id, e2.product_id, e2.site_id;
 QUERY PLAN

 GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
   -  Sort  (cost=828395.67..840117.89 rows=465 width=19)
 Sort Key: e1.product_id, e2.product_id, e2.site_id
 -  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
   -  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
 rows=369 width=49)
 Recheck Cond: (product_id = '82503'::citext)
 -  Bitmap Index Scan on events_product_id_site_id_idx  
 (cost=0.00..11.20 rows=369 width=0)
   Index Cond: (product_id = '82503'::citext)
   -  Index Scan using events_session_id_type_product_id_idx on 
 events e2  (cost=0.56..51.28 rows=12 width=51)
 Index Cond: ((session_id = e1.session_id) AND (type = 
 e1.type))
 Filter: (e1.product_id  product_id)
(11 rows)

recommender_production= \d events
Table public.events
   Column|   Type   |  Modifiers
-+--+-
 id  | bigint   | not null default 
 nextval('events_id_seq'::regclass)
 user_id | citext   |
 session_id  | citext   | not null
 product_id  | citext   | not null
 site_id | citext   | not null
 type| text | not null
 happened_at | timestamp with time zone | not null
 created_at  | timestamp with time zone | not null
Indexes:
events_pkey PRIMARY KEY, btree (id)
events_product_id_site_id_idx btree (product_id, site_id)
events_session_id_type_product_id_idx btree (session_id, type, 
 product_id)
Check constraints:
events_session_id_check CHECK (length(session_id::text)  255)
events_type_check CHECK (type = ANY (ARRAY['purchased'::text, 
 'viewed'::text]))
events_user_id_check CHECK (length(user_id::text)  255)






[PERFORM] Re: Hmmm... why does pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-03 Thread Marc Mamin


Hi everyone,

I've written a new open source tool for easily parallelising SQL scripts in 
postgres.   [obligatory plug:   https://github.com/gbb/par_psql   ]

Using it, I'm seeing a problem that I've also seen in other postgres projects 
involving high degrees of parallelisation in the last 12 months.

Basically:

- I have machines here with up to 16 CPU cores and 128GB memory, very fast 
SSDs and controller etc, carefully configured kernel/postgresql.conf for high 
performance.

- Ordinary queries parallelise nearly perfectly (e.g. SELECT some_stuff ...), 
e.g. almost up to 16x performance improvement.

- Non-DB stuff like GDAL, python etc. parallelise nearly perfectly.

- HOWEVER calls to CPU-intensive user-defined pl/pgsql functions (e.g. SELECT 
myfunction(some_stuff)) do not parallelise well, even when they are 
independently defined functions, or accessing tables in a read-only way. They 
hit a limit of 2.5x performance improvement relative to single-CPU performance 
(pg9.4) and merely 2x performance (pg9.3) regardless of how many CPU cores I 
throw at them. This is about 6 times slower than I'm expecting.


I can't see what would be locking. It seems like it's the pl/pgsql environment 
itself that is somehow locking or incurring some huge frictional costs. 
Whether I use independently defined functions, independent source tables, 
independent output tables, makes no difference whatsoever, so it doesn't feel 
'lock-related'. It also doesn't seem to be WAL/synchronisation related, as the 
machines I'm using can hit absurdly high pgbench rates, and I'm using unlogged 
tables for output.

Take a quick peek here: 
https://github.com/gbb/par_psql/blob/master/BENCHMARKS.md

I'm wondering what I'm missing here. Any ideas?

Graeme.


auto explain might help giving some insight in what's going on: 
http://www.postgresql.org/docs/9.4/static/auto-explain.html

Regards, 
Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GroupAggregate and Integer Arrays

2015-10-24 Thread Marc Mamin
> CREATE OR REPLACE FUNCTION array_add(int[],int[]) RETURNS int[] AS $$
>-- Add two arrays.
>select
>   ARRAY (
>  SELECT coalesce($1[i],0) + coalesce($2[i],0)
>  FROM (
> select generate_series(least(array_lower($1, 1),array_lower($2,
> 1)), greatest(array_upper($1, 1),array_upper($2, 1)), 1) AS i
>  ) sub
>GROUP BY i
>ORDER BY i
>);
> $$ LANGUAGE sql STRICT IMMUTABLE;

it seems that both the GROUP and ORDER BY are superfluous and adding some 
cycles.

regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse

2015-07-08 Thread Marc Mamin
Hello,
 
 
 
 
 
 From: pgsql-performance-ow...@postgresql.org 
 [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Mike Broers
 Sent: Dienstag, 7. Juli 2015 18:28
 To: Tom Lane
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] wildcard text filter switched to boolean column, 
 performance is way worse
 
 After bumping up work_mem from 12MB to 25MB that last materialize is indeed 
 hashing and this cut the query time by about 60%.  Thanks, this was very 
 helpful and gives me something else to look for when troubleshooting 
 explains.  
 
 
 
 On Tue, Jul 7, 2015 at 11:10 AM, Mike Broers mbro...@gmail.com wrote:
 Thanks, very informative! I'll experiment with work_mem settings and report 
 back.
 
 On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Mike Broers mbro...@gmail.com writes:
  I had a query that was filtering with a wildcard search of a text field for
  %SUCCESS%. The query took about 5 seconds and was running often so I wanted
  to improve it.  I suggested that the engineers include a new boolean column
  for successful status.  They implemented the requested field, but the query
  that filters on that new column runs very long (i kill it after letting it
  run for about an hour).  Can someone help me understand why that is the
  case and how to resolve it?
 
 It's hashing the subplan output in the first case and not the second:
 
  Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369) (actual
  time=4699.619..4699.869 rows=1 loops=1)
  Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
  ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3

 vs
 
Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)
  Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
  ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3
^
 
 Presumably, the new more-accurate rows count causes the planner to realize
 that the hash table will exceed work_mem so it doesn't choose to hash ...
 but for your situation, you'd rather it did, because what you're getting
 instead is a Materialize node that spills to disk (again, because the data
 involved exceeds work_mem) and that's a killer for this query.  You should
 be able to get back the old behavior if you raise work_mem enough.
 
 Another idea you might think about is changing the OR'd IN conditions
 to a single IN over a UNION ALL of the subselects.  I'm not really sure if
 that would produce a better plan, but it's worth trying if it wouldn't
 require too much app-side contortion.


Hello,
you might try to use a CTE to first collect the IDs to exclude, and join them 
to your main table.
This should result in an anti join pattern.

Something like:

WITH IDS as (
SELECT U1.lead_id AS lead_id 
FROM event U1 
WHERE U1.event_type ='type_1'
UNION (
   SELECT U1.lead_id AS lead_id 
   FROM event U1 
   WHERE U1.event_type = 'type_2'
   INTERSECT
   SELECT U1.lead_id AS lead_id 
   FROM event U1 
   WHERE successful
)
)
SELECT * FROM lead LEFT OUTER JOIN IDS ON (lead.id=IDS.lead_id)
WHERE IDS.lead_id IS NULL;

regards,

Marc Mamin






 regards, tom lane
 
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query help

2016-01-07 Thread Marc Mamin


>I ask your help to solve a slow query which is taking more than 14 seconds to 
>be executed.
>Maybe I am asking too much both from you and specially from postgresql, as it 
>is really huge, envolving 16 tables.
>
>Explain:
>http://explain.depesz.com/s/XII9
>
>Schema:
>http://adj.com.br/erp/data_schema/

Hello,

It seems that you don't pay much attention to column alignment.
e.g. http://adj.com.br/erp/data_schema/tables/ERP_PUBLIC_sys_person.html

This probably won't make any significant difference in your case,
but this is something to be aware of when dealing with large tables.
here is a good starting link for this topic:
http://stackoverflow.com/questions/12604744/does-the-order-of-columns-in-a-postgres-table-impact-performance

regards,

Marc Mamin


Re: [PERFORM] gin performance issue.

2016-02-08 Thread Marc Mamin


> -Original Message-
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Sent: Freitag, 5. Februar 2016 16:07
 

> > http://explain.depesz.com/s/wKv7
> > Postgres Version 9.3.10 (Linux)
> > 
> > Hello,
> > this is a large daily table that only get bulk inserts (200-400 /days) with 
> > no update.
> > After rebuilding the whole table, the Bitmap Index Scan on
> > r_20160204_ix_toprid falls under 1 second (from 800)
> >
> > Fastupdate is using the default, but autovacuum is disabled on that
> > table which contains 30 Mio rows.


> Pre-9.5, it's a pretty bad idea to disable autovacuum on a GIN index,
> because then the "pending list" only gets flushed when it exceeds
> work_mem.  (Obviously, using a large work_mem setting makes this
> worse.)
> 
>   regards, tom lane


Hello,
knowing what the problem is don't really help here:

- auto vacuum will not run as these are insert only tables
- according to this post, auto analyze would also do the job:
  http://postgresql.nabble.com/Performance-problem-with-gin-index-td5867870.html
  It seems that this information is missing in the doc
  
  but it sadly neither triggers in our case as we have manual analyzes called 
during the dataprocesssing just following the imports.
  Manual vacuum is just too expensive here.
  
  Hence disabling fast update seems to be our only option. 
  
  I hope this problem will help push up the 9.5 upgrade on our todo list :)
  
  Ideally, we would then like to flush the pending list inconditionally after 
the imports. 
  I guess we could achieve something approaching while modifying the analyze 
scale factor  and gin_pending_list_limit
  before/after the (bulk) imports, but having the possibility to flush it per 
SQL would be better. 
  Is this a reasonable feature wish?
  
  And a last question: how does the index update work with bulk (COPY) inserts:
  without pending list: is it like a per row trigger or will the index be cared 
of afterwards ?
  with small pending lists : is there a concurrency problem, or can both tasks 
cleanly work in parallel ?
  
  best regards,
  
  Marc mamin
  
  


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] gin performance issue.

2016-02-05 Thread Marc Mamin
http://explain.depesz.com/s/wKv7
Postgres Version 9.3.10 (Linux)

Hello,
this is a large daily table that only get bulk inserts (200-400 /days) with no 
update.
After rebuilding the whole table, the Bitmap Index Scan on r_20160204_ix_toprid 
falls under 1 second (from 800)

Fastupdate is using the default, but autovacuum is disabled on that table which 
contains 30 Mio rows.
Another specificity is that the cardinality of the indexed column is very high. 
The average count per distinct values is only 2.7

I'm not sure what the problem is. Does the missing vacuum affect the gin index 
sanity further than not cleaning the pending list?
As I understand it, this list will be merged into the index automatically when 
it get full, independently from the vaccum setting.

Can it be an index bloating issue ?

and last but not least, can I reduce the problem by configuration ?

regards,

Marc Mamin



Re: [PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-11 Thread Marc Mamin



> From: pgsql-performance-ow...@postgresql.org 
> [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of 
> l...@laurent-hasson.com
> Sent: Freitag, 11. November 2016 07:54
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Inlining of functions (doing LIKE on an array)
> 
> Hello,
> 
> I am trying to implement an efficient "like" over a text[]. I see a lot of 
> people have tried before me and I learnt a lot through the forums. The 
> results of my search is that a query like the following is optimal:
> 
> select count(*) 
>   from claims
> where (select count(*) 
>   from unnest("ICD9_DGNS_CD") x_ 
>  where x_ like '427%'
>) > 0
> 

Hi,
are you using GIN indexes?

http://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns 

moreover your query can still be optimized:
=>
select count(*) 
  from claims
where exists (select *
      from unnest("ICD9_DGNS_CD") x_ 
 where x_ like '427%'
   ) 

regards,

Marc Mamin

> So I figured I'd create a Function to encapsulate the concept:
> 
> CREATE OR REPLACE FUNCTION ArrayLike(text[], text)
> RETURNS bigint
> AS 'select count(*) from unnest($1) a where a like $2'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
> 
> This works functionally, but performs like crap: full table scan, and cannot 
> make use of any index it seems. Basically, it feels like PG can't inline that 
> function.
> 
> I have been trying all evening to find a way to rewrite it to trick the 
> compiler/planner into inlining. I tried the operator approach for example, 
> but performance is again not good.
> 
> create function rlike(text,text) 
> returns bool as 'select $2 like $1' language sql strict immutable;
> create operator  ``` (procedure = rlike, leftarg = text, 
>   rightarg = text, commutator = ```);
> CREATE OR REPLACE FUNCTION MyLike(text[], text)
> RETURNS boolean
> AS 'select $2 ``` ANY($1)'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
> 
> And by not good, I mean that on my table of 2M+ rows, the "native" query 
> takes 3s, while the function version takes 9s and the operator version takes 
> (via the function, or through the operator directly), takes 15s.
> 
> Any ideas or pointers?
> 
> 
> Thank you,
> Laurent Hasson


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance