[PERFORM] Optimizer : query rewrite and execution plan ?

2008-02-06 Thread SURANTYN Jean François
Hi
 
I have discovered an issue on my Postgresql database recently installed : it 
seems that the optimizer can not, when possible, simplify and rewrite a simple 
query before running it. Here is a simple and reproducible example :
 
my_db=# create table test (n numeric);
CREATE
my_db=# insert into test values (1); -- run 10 times
INSERT
my_db=# insert into test values (0); -- run 10 times
INSERT
my_db=# select count(*) from test;
count
---
20
(1 row)
my_db=# vacuum full analyze test;
VACUUM
my_db=# explain select * from test where n = 1;
QUERY PLAN
--
Seq Scan on test (cost=0.00..1.25 rows=10 width=9)
Filter: (n = 1::numeric)
(2 rows)
 
my_db=# explain select * from test where n = 1 and n = 1;
QUERY PLAN
-
Seq Scan on test (cost=0.00..1.30 rows=5 width=9)
Filter: ((n = 1::numeric) AND (n = 1::numeric))
(2 rows)
 
In the first SELECT query (with where n=1), the estimated number of returned 
rows is correct (10), whereas in the second SELECT query (with where n=1 and 
n=1), the estimated number of returned rows is 5 (instead of 10 !)
So the optimizer has under-estimated the number of rows returned
That issue is very annoying because with generated SQL queries (from Business 
Objects for example) on big tables, it is possible that some queries have 
several times the same where condition (where n=1 and n=1 for example), and 
as the optimizer is under-estimating the number of returned rows, some bad 
execution plans can be chosen (nested loops instead of hash joins for example)
 
Is the estimated number of returned rows directly linked to the decision of the 
optimizer to chose Hash Joins or Nested Loops in join queries ?
Is there a way for the Postgresql optimizer to be able to simplify and rewrite 
the SQL statements before running them ? Are there some parameters that could 
change the execution plans ?
 
Thanks by advance for your help
 
Jean-Francois SURANTYN
 

**
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

Supermarchés MATCH, Société Par Actions Simplifiée au capital de 10 420 100 €, 
immatriculée au RCS de LILLE sous le Numéro B 785 480 351
Siège : 250, rue du Général de Gaulle - BP 201 - 59 561 LA MADELEINE Cedex
**



Re: [PERFORM] Optimizer : query rewrite and execution plan ?

2008-02-06 Thread Richard Huxton
SURANTYN Jean François wrote:
 my_db=# explain select * from test where n = 1;

 my_db=# explain select * from test where n = 1 and n = 1;

 In the first SELECT query (with where n=1), the estimated number of
 returned rows is correct (10), whereas in the second SELECT query
 (with where n=1 and n=1), the estimated number of returned rows is
 5 (instead of 10 !) So the optimizer has under-estimated the number
 of rows returned

That's because it's a badly composed query. The planner is guessing how
much overlap there would be between the two clauses. It's not exploring
the option that they are the same clause repeated.

 That issue is very annoying because with generated
 SQL queries (from Business Objects for example) on big tables, it is
 possible that some queries have several times the same where
 condition (where n=1 and n=1 for example), and as the optimizer is
 under-estimating the number of returned rows, some bad execution
 plans can be chosen (nested loops instead of hash joins for example)

Sounds like your query-generator needs a bit of an improvement, from my end.

 Is the estimated number of returned rows directly linked to the
 decision of the optimizer to chose Hash Joins or Nested Loops in join
 queries ? 

Yes, well the cost determines a plan and obviously number of rows
affects the cost.

 Is there a way for the Postgresql optimizer to be able to
 simplify and rewrite the SQL statements before running them ? 

It does, just not this one. It spots things like a=b and b=c implies a=c
(for joins etc).

 Are
 there some parameters that could change the execution plans ?

Not really in this case.

The root of your problem is that you have a query with an irrelevant
clause (AND n=1) and you'd like the planner to notice that it's
irrelevant and remove it from the query. There are two problems with this:

1. It's only going to be possible in simple cases. It's unlikely the
planner would ever spot n=2 AND n=(10/5)
2. Even in the simple case you're going to waste time checking *every
query* to see if clauses could be eliminated.

Is there any way to improve your query generator?

-- 
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Optimizer : query rewrite and execution plan ?

2008-02-06 Thread Richard Huxton
SURANTYN Jean François wrote:
 Many thanks for your quick reply
 
 In fact, that issue comes from a recent migration from Oracle to
 Postgresql, and even if some queries were not optimized by the past
 (example: where n=1 and n=1), Oracle was able to rewrite them and to
 hide the bad queries. But now that we have migrated to Postgresql,
 we have discovered that some queries were indeed badly wroten I will
 tell to the developpers to try to optimize their queries for them to
 work efficiently on Postgresql

If nothing else it will help when / if you decide to use prepared
queries - there's no way to optimise n=$1 or n=$2 at planning time.

-- 
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Optimizer : query rewrite and execution plan ?

2008-02-06 Thread SURANTYN Jean François
Many thanks for your quick reply 

In fact, that issue comes from a recent migration from Oracle to Postgresql, 
and even if some queries were not optimized by the past (example: where n=1 and 
n=1), Oracle was able to rewrite them and to hide the bad queries. But now 
that we have migrated to Postgresql, we have discovered that some queries were 
indeed badly wroten
I will tell to the developpers to try to optimize their queries for them to 
work efficiently on Postgresql

Thanks again for your help

Regards

Jean-Francois SURANTYN


-Message d'origine-
De : Richard Huxton [mailto:[EMAIL PROTECTED] 
Envoyé : mercredi 6 février 2008 10:47
À : SURANTYN Jean François
Cc : pgsql-performance@postgresql.org
Objet : Re: [PERFORM] Optimizer : query rewrite and execution plan ?

SURANTYN Jean François wrote:
 my_db=# explain select * from test where n = 1;

 my_db=# explain select * from test where n = 1 and n = 1;

 In the first SELECT query (with where n=1), the estimated number of 
 returned rows is correct (10), whereas in the second SELECT query 
 (with where n=1 and n=1), the estimated number of returned rows is
 5 (instead of 10 !) So the optimizer has under-estimated the number of 
 rows returned

That's because it's a badly composed query. The planner is guessing how much 
overlap there would be between the two clauses. It's not exploring the option 
that they are the same clause repeated.

 That issue is very annoying because with generated SQL queries (from 
 Business Objects for example) on big tables, it is possible that some 
 queries have several times the same where
 condition (where n=1 and n=1 for example), and as the optimizer is 
 under-estimating the number of returned rows, some bad execution plans 
 can be chosen (nested loops instead of hash joins for example)

Sounds like your query-generator needs a bit of an improvement, from my end.

 Is the estimated number of returned rows directly linked to the 
 decision of the optimizer to chose Hash Joins or Nested Loops in join 
 queries ?

Yes, well the cost determines a plan and obviously number of rows affects the 
cost.

 Is there a way for the Postgresql optimizer to be able to simplify and 
 rewrite the SQL statements before running them ?

It does, just not this one. It spots things like a=b and b=c implies a=c (for 
joins etc).

 Are
 there some parameters that could change the execution plans ?

Not really in this case.

The root of your problem is that you have a query with an irrelevant clause 
(AND n=1) and you'd like the planner to notice that it's irrelevant and remove 
it from the query. There are two problems with this:

1. It's only going to be possible in simple cases. It's unlikely the planner 
would ever spot n=2 AND n=(10/5)
2. Even in the simple case you're going to waste time checking *every
query* to see if clauses could be eliminated.

Is there any way to improve your query generator?

--
  Richard Huxton
  Archonet Ltd

**
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

Supermarchés MATCH, Société Par Actions Simplifiée au capital de 10 420 100 €, 
immatriculée au RCS de LILLE sous le Numéro B 785 480 351
Siège : 250, rue du Général de Gaulle - BP 201 - 59 561 LA MADELEINE Cedex
**


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Benchmark Data requested

2008-02-06 Thread Dimitri Fontaine
Le mercredi 06 février 2008, Greg Smith a écrit :
 pgloader is a great tool for a lot of things, particularly if there's any
 chance that some of your rows will get rejected.  But the way things pass
 through the Python/psycopg layer made it uncompetative (more than 50%
 slowdown) against the straight COPY path from a rows/second perspective
 the last time (V2.1.0?) 

I've yet to add in the psycopg wrapper Marko wrote for skytools: at the moment 
I'm using the psycopg1 interface even when psycopg2 is used, and it seems the 
new version has some great performance improvements. I just didn't bother 
until now thinking this wouldn't affect COPY.

 I did what I thought was a fair test of it (usual 
 caveat of with the type of data I was loading).  Maybe there's been some
 gigantic improvement since then, but it's hard to beat COPY when you've
 got an API layer or two in the middle.

Did you compare to COPY or \copy? I'd expect psycopg COPY api not to be that 
more costly than psql one, after all.
Where pgloader is really left behind (in term of tuples inserted per second) 
compared to COPY is when it has to jiggle a lot with the data, I'd say 
(reformat, reorder, add constants, etc). But I've tried to design it so that 
when not configured to arrange (massage?) the data, the code path is the 
simplest possible.

Do you want to test pgloader again with Marko psycopgwrapper code to see if 
this helps? If yes I'll arrange to push it to CVS ASAP.

Maybe at the end of this PostgreSQL backend code will be smarter than pgloader 
(wrt error handling and data massaging) and we'll be able to drop the 
project, but in the meantime I'll try my best to have pgloader as fast as 
possible :)
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Optimizer : query rewrite and execution plan ?

2008-02-06 Thread Simon Riggs
On Wed, 2008-02-06 at 09:42 +0100, SURANTYN Jean François wrote:

 That issue is very annoying because with generated SQL queries (from
 Business Objects for example) on big tables, it is possible that some
 queries have several times the same where condition (where n=1 and
 n=1 for example), and as the optimizer is under-estimating the number
 of returned rows, some bad execution plans can be chosen (nested loops
 instead of hash joins for example)

I can see the annoyance there.

There's a balance in the planner between time spent to optimize the
query and time spent to correct mistakes. If we looked continually for
mistakes then planning time would increase for everybody that didn't
suffer from this problem.

Since the SQL is not your fault and difficult to control, it is an
argument in favour of an optional planner mode that would perform
additional checks for redundant clauses of various kinds. The default
for that would be off since most people don't suffer from this
problem. BO isn't the only SQL generating-client out there, so I think
this is a fairly wide problem.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Simon Riggs
On Wed, 2008-02-06 at 12:27 +0100, Dimitri Fontaine wrote:
 Multi-Threading behavior and CE support
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 
 Now, pgloader will be able to run N threads, each one loading some
 data to a 
 partitionned child-table target. N will certainly be configured
 depending on 
 the number of server cores and not depending on the partition
 numbers...
 
 So what do we do when reading a tuple we want to store in a partition
 which 
 has no dedicated Thread started yet, and we already have N Threads
 running?
 I'm thinking about some LRU(Thread) to choose a Thread to terminate
 (launch 
 COPY with current buffer and quit) and start a new one for the
 current 
 partition target.
 Hopefully there won't be such high values of N that the LRU is a bad
 choice 
 per see, and the input data won't be so messy to have to stop/start
 Threads 
 at each new line.

For me, it would be good to see a --parallel=n parameter that would
allow pg_loader to distribute rows in round-robin manner to n
different concurrent COPY statements. i.e. a non-routing version. Making
that work well, whilst continuing to do error-handling seems like a
challenge, but a very useful goal.

Adding intelligence to the row distribution may be technically hard but
may also simply move the bottleneck onto pg_loader. We may need multiple
threads in pg_loader, or we may just need multiple sessions from
pg_loader. Experience from doing the non-routing parallel version may
help in deciding whether to go for the routing version.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Hi,

I've been thinking about this topic some more, and as I don't know when I'll 
be able to go and implement it I'd want to publish the ideas here. This way 
I'll be able to find them again :)

Le mardi 05 février 2008, Dimitri Fontaine a écrit :
 Le mardi 05 février 2008, Simon Riggs a écrit :
  Much better than triggers and rules, but it will be hard to get it to
  work.

 Well, I'm thinking about providing a somewhat modular approach where
 pgloader code is able to recognize CHECK constraints, load a module
 registered to the operator and data types, then use it.

Here's how I think I'm gonna implement it:

User level configuration
-=-=-=-=-=-=-=-=-=-

At user level, you will have to add a constraint_exclusion = on parameter to 
pgloader section configuration for it to bother checking if the destination 
table has some children etc.
You'll need to provide also a global ce_path parameter (where to find user 
python constraint exclusion modules) and a ce_modules parameter for each 
section where constraint_exclusion = on:
  ce_modules = columnA:module:class, columnB:module:class

As the ce_path could point to any number of modules where a single type is 
supported by several modules, I'll let the user choose which module to use.

Constraint exclusion modules
-=-=-=-=-=-=-=-=-=-=-=-=-

The modules will provide one or several class(es) (kind of a packaging issue), 
each one will have to register which datatypes and operators they know about. 
Here's some pseudo-code of a module, which certainly is the best way to 
express a code design idea:

class MyCE:
   def __init__(self, operator, constant, cside='r'):
   CHECK ( col operator constant ) = cside = 'r', could be 'l' 
  ...

   @classmethod
   def support_type(cls, type):
  return type in ['integer', 'bigint', 'smallint', 'real', 'double']

   @classmethod
   def support_operator(cls, op):
   return op in ['=', '', '', '=', '=', '%']

   def check(self, op, data):
  if op == '' : return self.gt(data)
  ...

   def gt(self, data):
  if cside == 'l':
 return self.constant  data
  elif cside == 'r':
 return data  self.constant

This way pgloader will be able to support any datatype (user datatype like 
IP4R included) and operator (@@, ~= or whatever). For pgloader to handle a 
CHECK() constraint, though, it'll have to be configured to use a CE class 
supporting the used operators and datatypes.

PGLoader constraint exclusion support
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The CHECK() constraint being a tree of check expressions[*] linked by logical 
operators, pgloader will have to build some logic tree of MyCE (user CE 
modules) and evaluate all the checks in order to be able to choose the input 
line partition.

[*]: check((a % 10) = 1) makes an expression tree containing 2 check nodes

After having parsed pg_constraint.consrc (not conbin which seems too much an 
internal dump for using it from user code) and built a CHECK tree for each 
partition, pgloader will try to decide if it's about range partitioning (most 
common case). 

If each partition CHECK tree is AND((a=b, ac) or a variation of it, we have 
range partitioning. Then surely we can optimize the code to run to choose the 
partition where to COPY data to and still use the module operator 
implementation, e.g. making a binary search on a partitions limits tree.

If you want some other widely used (or not) partitioning scheme to be 
recognized and optimized by pgloader, just tell me and we'll see about it :)
Having this step as a user module seems overkill at the moment, though.

Multi-Threading behavior and CE support
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Now, pgloader will be able to run N threads, each one loading some data to a 
partitionned child-table target. N will certainly be configured depending on 
the number of server cores and not depending on the partition numbers...

So what do we do when reading a tuple we want to store in a partition which 
has no dedicated Thread started yet, and we already have N Threads running?
I'm thinking about some LRU(Thread) to choose a Thread to terminate (launch 
COPY with current buffer and quit) and start a new one for the current 
partition target.
Hopefully there won't be such high values of N that the LRU is a bad choice 
per see, and the input data won't be so messy to have to stop/start Threads 
at each new line.

Comments welcome, regards,
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Optimizer : query rewrite and execution plan ?

2008-02-06 Thread Theo Kramer
On Wed, 2008-02-06 at 11:53 +, Simon Riggs wrote:
 On Wed, 2008-02-06 at 09:42 +0100, SURANTYN Jean François wrote:
 
  That issue is very annoying because with generated SQL queries (from
  Business Objects for example) on big tables, it is possible that some
  queries have several times the same where condition (where n=1 and
  n=1 for example), and as the optimizer is under-estimating the number
  of returned rows, some bad execution plans can be chosen (nested loops
  instead of hash joins for example)
 
 I can see the annoyance there.
 
 There's a balance in the planner between time spent to optimize the
 query and time spent to correct mistakes. If we looked continually for
 mistakes then planning time would increase for everybody that didn't
 suffer from this problem.
 
 Since the SQL is not your fault and difficult to control, it is an
 argument in favour of an optional planner mode that would perform
 additional checks for redundant clauses of various kinds. The default
 for that would be off since most people don't suffer from this
 problem. BO isn't the only SQL generating-client out there, so I think
 this is a fairly wide problem.

I would have to disagree. I spend a lot of time writing code that
generates SQL from a business app and feel strongly that any
optimisation is my responsibility.

Having to re-configure PG to switch on a planner mode, as suggested
above, to address badly generated SQL is not a good idea.

This with experience on having to talk business application developers
through re-configuring a database.

-- 
Regards
Theo


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le mercredi 06 février 2008, Simon Riggs a écrit :
 For me, it would be good to see a --parallel=n parameter that would
 allow pg_loader to distribute rows in round-robin manner to n
 different concurrent COPY statements. i.e. a non-routing version.

What happen when you want at most N parallel Threads and have several sections 
configured: do you want pgloader to serialize sections loading (often there's 
one section per table, sometimes different sections target the same table) 
but parallelise each section loading?

I'm thinking we should have a global max_threads knob *and* and per-section 
max_thread one if we want to go this way, but then multi-threaded sections 
will somewhat fight against other sections (multi-threaded or not) for 
threads to use.

So I'll also add a parameter to configure how many (max) sections to load in 
parallel at any time.

We'll then have (default values presented):
max_threads = 1
max_parallel_sections = 1
section_threads = -1

The section_threads parameter would be overloadable at section level but would 
need to stay = max_threads (if not, discarded, warning issued). When 
section_threads is -1, pgloader tries to have the higher number of them 
possible, still in the max_threads global limit.
If max_parallel_section is -1, pgloader start a new thread per each new 
section, maxing out at max_threads, then it waits for a thread to finish 
before launching a new section loading.

If you have N max_threads and max_parallel_sections = section_threads = -1, 
then we'll see some kind of a fight between new section threads and in 
section thread (the parallel non-routing COPY behaviour). But then it's a 
user choice.

Adding in it the Constraint_Exclusion support would not mess it up, but it'll 
have some interest only when section_threads != 1 and max_threads  1.

 Making 
 that work well, whilst continuing to do error-handling seems like a
 challenge, but a very useful goal.

Quick tests showed me python threading model allows for easily sharing of 
objects between several threads, I don't think I'll need to adjust my reject 
code when going per-section multi-threaded. Just have to use a semaphore 
object to continue rejected one line at a time. Not that complex if reliable.

 Adding intelligence to the row distribution may be technically hard but
 may also simply move the bottleneck onto pg_loader. We may need multiple
 threads in pg_loader, or we may just need multiple sessions from
 pg_loader. Experience from doing the non-routing parallel version may
 help in deciding whether to go for the routing version.

If non-routing per-section multi-threading is a user request and not that hard 
to implement (thanks to python), that sounds a good enough reason for me to 
provide it :)

I'll keep you (and the list) informed as soon as I'll have the code to play 
with.
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Optimizer : query rewrite and execution plan ?

2008-02-06 Thread Roberts, Jon

  Since the SQL is not your fault and difficult to control, it is an
  argument in favour of an optional planner mode that would perform
  additional checks for redundant clauses of various kinds. The
default
  for that would be off since most people don't suffer from this
  problem. BO isn't the only SQL generating-client out there, so I
think
  this is a fairly wide problem.
 
 I would have to disagree. I spend a lot of time writing code that
 generates SQL from a business app and feel strongly that any
 optimisation is my responsibility.
 

The point to a BI tool like BO is to abstract the data collection and do
it dynamically.  The SQL is built at run time because the tool is
designed to give the end user as much flexibility as the data structure
allows to query the data however they want.

It isn't feasible, possible, or recommended to rewrite all of the
possible generated SQL that could be designed at runtime by the tool.  



Jon

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Optimizer : query rewrite and execution plan ?

2008-02-06 Thread Erik Jones


On Feb 6, 2008, at 7:35 AM, Roberts, Jon wrote:




Since the SQL is not your fault and difficult to control, it is an
argument in favour of an optional planner mode that would perform
additional checks for redundant clauses of various kinds. The

default

for that would be off since most people don't suffer from this
problem. BO isn't the only SQL generating-client out there, so I

think

this is a fairly wide problem.


I would have to disagree. I spend a lot of time writing code that
generates SQL from a business app and feel strongly that any
optimisation is my responsibility.



The point to a BI tool like BO is to abstract the data collection  
and do

it dynamically.  The SQL is built at run time because the tool is
designed to give the end user as much flexibility as the data  
structure

allows to query the data however they want.

It isn't feasible, possible, or recommended to rewrite all of the
possible generated SQL that could be designed at runtime by the tool.


No, but it is feasible to expect the tool to generate well-formed  
queries without redundant clauses.  There are plenty that do.


Erik Jones

DBA | Emma®
[EMAIL PROTECTED]
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate  market in style.
Visit us online at http://www.myemma.com




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Benchmark Data requested

2008-02-06 Thread Greg Smith

On Wed, 6 Feb 2008, Dimitri Fontaine wrote:


Did you compare to COPY or \copy?


COPY.  If you're loading a TB, if you're smart it's going onto the server 
itself if it all possible and loading directly from there.  Would probably 
get a closer comparision against psql \copy, but recognize you're always 
going to be compared against the best server-side copy method available.



Do you want to test pgloader again with Marko psycopgwrapper code to see if
this helps?


Wouldn't have time myself for at least a month (probably not until after 
the East convention) so don't go making commits on my behalf.



Maybe at the end of this PostgreSQL backend code will be smarter than pgloader
(wrt error handling and data massaging) and we'll be able to drop the
project


There are way too many data massaging cases I never expect the backend 
will handle that pgloader does a great job of right now, and I think there 
will always be a niche for a tool like this.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Optimizer : query rewrite and execution plan ?

2008-02-06 Thread Roberts, Jon
 
  Since the SQL is not your fault and difficult to control, it is an
  argument in favour of an optional planner mode that would perform
  additional checks for redundant clauses of various kinds. The
  default
  for that would be off since most people don't suffer from this
  problem. BO isn't the only SQL generating-client out there, so I
  think
  this is a fairly wide problem.
 
  I would have to disagree. I spend a lot of time writing code that
  generates SQL from a business app and feel strongly that any
  optimisation is my responsibility.
 
 
  The point to a BI tool like BO is to abstract the data collection
  and do
  it dynamically.  The SQL is built at run time because the tool is
  designed to give the end user as much flexibility as the data
  structure
  allows to query the data however they want.
 
  It isn't feasible, possible, or recommended to rewrite all of the
  possible generated SQL that could be designed at runtime by the
tool.
 
 No, but it is feasible to expect the tool to generate well-formed
 queries without redundant clauses.  There are plenty that do.
 


Agreed.


Jon

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Greg Smith

On Wed, 6 Feb 2008, Simon Riggs wrote:


For me, it would be good to see a --parallel=n parameter that would
allow pg_loader to distribute rows in round-robin manner to n
different concurrent COPY statements. i.e. a non-routing version.


Let me expand on this.  In many of these giant COPY situations the 
bottleneck is plain old sequential I/O to a single process.  You can 
almost predict how fast the rows will load using dd.  Having a process 
that pulls rows in and distributes them round-robin is good, but it won't 
crack that bottleneck.  The useful approaches I've seen for other 
databases all presume that the data files involved are large enough that 
on big hardware, you can start multiple processes running at different 
points in the file and beat anything possible with a single reader.


If I'm loading a TB file, odds are good I can split that into 4 or more 
vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders 
at once, and get way more than 1 disk worth of throughput reading.  You 
have to play with the exact number because if you push the split too far 
you introduce seek slowdown instead of improvements, but that's the basic 
design I'd like to see one day.  It's not parallel loading that's useful 
for the cases I'm thinking about until something like this comes around.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Optimizer : query rewrite and execution plan ?

2008-02-06 Thread Tom Lane
Theo Kramer [EMAIL PROTECTED] writes:
 On Wed, 2008-02-06 at 11:53 +, Simon Riggs wrote:
 Since the SQL is not your fault and difficult to control, it is an
 argument in favour of an optional planner mode that would perform
 additional checks for redundant clauses of various kinds. The default
 for that would be off since most people don't suffer from this
 problem. BO isn't the only SQL generating-client out there, so I think
 this is a fairly wide problem.

 I would have to disagree. I spend a lot of time writing code that
 generates SQL from a business app and feel strongly that any
 optimisation is my responsibility.

Disagree with what?  If that's your feeling then you'd leave the setting
off, and no harm done.

We used to have code that removed duplicate WHERE clauses (check the
revision history for prepqual.c).  It was taken out because it consumed
excessive amounts of planning time without accomplishing a darn thing
for most queries.  There is no chance that it will be put back in as the
only behavior, or even the default behavior, but I can see the reasoning
for offering an option as Simon suggests.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Luke Lonergan
Hi Greg,

On 2/6/08 7:56 AM, Greg Smith [EMAIL PROTECTED] wrote:

 If I'm loading a TB file, odds are good I can split that into 4 or more
 vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
 at once, and get way more than 1 disk worth of throughput reading.  You
 have to play with the exact number because if you push the split too far
 you introduce seek slowdown instead of improvements, but that's the basic
 design I'd like to see one day.  It's not parallel loading that's useful
 for the cases I'm thinking about until something like this comes around.

Just load 4 relfiles.  You have to be able to handle partial relfiles, which
changes the storage mgmt a bit, but the benefits are easier to achieve.

- Luke


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Jignesh K. Shah



Greg Smith wrote:

On Wed, 6 Feb 2008, Simon Riggs wrote:


For me, it would be good to see a --parallel=n parameter that would
allow pg_loader to distribute rows in round-robin manner to n
different concurrent COPY statements. i.e. a non-routing version.


Let me expand on this.  In many of these giant COPY situations the 
bottleneck is plain old sequential I/O to a single process.  You can 
almost predict how fast the rows will load using dd.  Having a process 
that pulls rows in and distributes them round-robin is good, but it 
won't crack that bottleneck.  The useful approaches I've seen for 
other databases all presume that the data files involved are large 
enough that on big hardware, you can start multiple processes running 
at different points in the file and beat anything possible with a 
single reader.


If I'm loading a TB file, odds are good I can split that into 4 or 
more vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 
4 loaders at once, and get way more than 1 disk worth of throughput 
reading.  You have to play with the exact number because if you push 
the split too far you introduce seek slowdown instead of improvements, 
but that's the basic design I'd like to see one day.  It's not 
parallel loading that's useful for the cases I'm thinking about until 
something like this comes around.




Some food for thought here: Most BI Type applications which does data 
conversions/cleansing also might end up sorting the data before its 
loaded into a database so starting parallel loaders at Total different 
points ruins that effort. A More pragmatic approach will be to read the 
next rows from the input file So if there are N parallel streams then 
each one is offset by 1 from each other and jumps by N rows so the seeks 
are pretty much narrrowed down to few rows (ideally 1) instead of 
jumping 1/Nth rows every time  a read happens.


For example to replicate this with dd to see the impact use a big file 
and use the seek option and blocksizes .. Somebody out here once had 
done that test and showed that seek time on the file being read is 
reduced significantly and depending on the file system it does 
intelligent prefetching (which unfortunately UFS in Solaris does not do 
best by default)  all the reads for the next stream will already be in 
memory.




Regards,
Jignesh


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Benchmark Data requested

2008-02-06 Thread Dimitri Fontaine
Le mercredi 06 février 2008, Greg Smith a écrit :
 COPY.  If you're loading a TB, if you're smart it's going onto the server
 itself if it all possible and loading directly from there.  Would probably
 get a closer comparision against psql \copy, but recognize you're always
 going to be compared against the best server-side copy method available.

Fair enough on your side, even if I can't expect an external tool using 
network protocol to compete with backend reading a local file. I wanted to 
make sure the 50% slowdown was not only due to my code being that bad.

 There are way too many data massaging cases I never expect the backend
 will handle that pgloader does a great job of right now, and I think there
 will always be a niche for a tool like this.

Let's try to continue improving the tool then!
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le mercredi 06 février 2008, Greg Smith a écrit :
 If I'm loading a TB file, odds are good I can split that into 4 or more
 vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
 at once, and get way more than 1 disk worth of throughput reading.

pgloader already supports starting at any input file line number, and limit 
itself to any number of reads:

  -C COUNT, --count=COUNT
number of input lines to process
  -F FROMCOUNT, --from=FROMCOUNT
number of input lines to skip

So you could already launch 4 pgloader processes with the same configuration 
fine but different command lines arguments. It there's interest/demand, it's 
easy enough for me to add those parameters as file configuration knobs too.

Still you have to pay for client to server communication instead of having the 
backend read the file locally, but now maybe we begin to compete?

Regards,
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Luke Lonergan
Improvements are welcome, but to compete in the industry, loading will need to 
speed up by a factor of 100.

Note that Bizgres loader already does many of these ideas and it sounds like 
pgloader does too.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Dimitri Fontaine [mailto:[EMAIL PROTECTED]
Sent:   Wednesday, February 06, 2008 12:41 PM Eastern Standard Time
To: pgsql-performance@postgresql.org
Cc: Greg Smith
Subject:Re: [PERFORM] Benchmark Data requested --- pgloader CE design 
ideas

Le mercredi 06 février 2008, Greg Smith a écrit :
 If I'm loading a TB file, odds are good I can split that into 4 or more
 vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
 at once, and get way more than 1 disk worth of throughput reading.

pgloader already supports starting at any input file line number, and limit 
itself to any number of reads:

  -C COUNT, --count=COUNT
number of input lines to process
  -F FROMCOUNT, --from=FROMCOUNT
number of input lines to skip

So you could already launch 4 pgloader processes with the same configuration 
fine but different command lines arguments. It there's interest/demand, it's 
easy enough for me to add those parameters as file configuration knobs too.

Still you have to pay for client to server communication instead of having the 
backend read the file locally, but now maybe we begin to compete?

Regards,
-- 
dim


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le Wednesday 06 February 2008 18:37:41 Dimitri Fontaine, vous avez écrit :
 Le mercredi 06 février 2008, Greg Smith a écrit :
  If I'm loading a TB file, odds are good I can split that into 4 or more
  vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4
  loaders at once, and get way more than 1 disk worth of throughput
  reading.

 pgloader already supports starting at any input file line number, and limit
 itself to any number of reads:

In fact, the -F option works by having pgloader read the given number of lines 
but skip processing them, which is not at all what Greg is talking about here 
I think.

Plus, I think it would be easier for me to code some stat() then lseek() then 
read() into the pgloader readers machinery than to change the code 
architecture to support a separate thread for the file reader.

Greg, what would you think of a pgloader which will separate file reading 
based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of 
threads: we split into as many pieces as section_threads section config 
value.

This behaviour won't be available for sections where type = text and 
field_count(*) is given, cause in this case I don't see how pgloader could 
reliably recognize a new logical line beginning and start processing here.
In other cases, a logical line is a physical line, so we start after first 
newline met from given lseek start position, and continue reading after the 
last lseek position until a newline.

*:http://pgloader.projects.postgresql.org/#_text_format_configuration_parameters

Comments?
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le Wednesday 06 February 2008 18:49:56 Luke Lonergan, vous avez écrit :
 Improvements are welcome, but to compete in the industry, loading will need
 to speed up by a factor of 100.

Oh, I meant to compete with internal COPY command instead of \copy one, not 
with the competition. AIUI competing with competition will need some 
PostgreSQL internal improvements, which I'll let the -hackers do :)

 Note that Bizgres loader already does many of these ideas and it sounds
 like pgloader does too.

We're talking about how to improve pgloader :)

-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Greg Smith

On Wed, 6 Feb 2008, Dimitri Fontaine wrote:


In fact, the -F option works by having pgloader read the given number of lines
but skip processing them, which is not at all what Greg is talking about here
I think.


Yeah, that's not useful.


Greg, what would you think of a pgloader which will separate file reading
based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of
threads: we split into as many pieces as section_threads section config
value.


Now you're talking.  Find a couple of split points that way, fine-tune the 
boundaries a bit so they rest on line termination points, and off you go. 
Don't forget that the basic principle here implies you'll never know until 
you're done just how many lines were really in the file.  When thread#1 is 
running against chunk#1, it will never have any idea what line chunk#2 
really started at until it reaches there, at which point it's done and 
that information isn't helpful anymore.


You have to stop thinking in terms of lines for this splitting; all you 
can do is split the file into useful byte sections and then count the 
lines within them as you go.  Anything else requires a counting scan of 
the file and such a sequential read is exactly what can't happen 
(especially not more than once), it just takes too long.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq