Re: [PERFORM] Internal operations when the planner makes a hash join.

2010-02-24 Thread Igor Neyman
 

 -Original Message-
 From: negora [mailto:neg...@negora.com] 
 Sent: Tuesday, February 23, 2010 4:33 PM
 To: Scott Carey
 Cc: Alvaro Herrera; pgsql-performance@postgresql.org
 Subject: Re: Internal operations when the planner makes a hash join.
 
 Thank you for explaining me the internal behaviour of the 
 PostgreSQL engine. I'll try to look for more information 
 about that hash tables. It sounds really really interesting. 
 Your information was very useful.
 
 The origin of my doubt resides in the fact that I need to do 
 a joint between 3 HUGE tables (millions of registries) and do 
 certain operations with the retrieved information. I was 
 deciding whether to use one SELECT with 3 JOINs, as I've been 
 doing since the beginning, or build a PL/PgSQL function based 
 on 3 nested FOR ... IN SELECT ... LOOP 
 structures which tried to minimize the subsequent table 
 searches storing intermediate useful data in arrays 
 (curiously, these would act as the hash tables which you 
 mention, but in a very very rudimentary way). In a case like 
 this one (possibly unable to fit in RAM), Is also JOIN the 
 best solution?
 
 Since I've to retrieve such a big amount of columns and 
 crossed registries I had started to think that using 1 SELECT 
 with 3 JOINs would increase the number of table searches a 
 LOT and also duplicate the information too much. I mean 
 duplicate as in this case, where the Factor 1 appears 
 millions of times for every Element:
 
 Element 1 | Sub-factor 1 | Factor 1
 Element 2 | Subf-actor 1 | Factor 1
 ...
 Element 12639747465586 | Sub-factor 1 | Factor 1 Element 1 | 
 Sub-factor 2 | Factor 1
 
 I hope not to robber you much time but... What do you think 
 about it? Is it better either 1 SELECT with 3 JOINs or build 
 nested FOR ... IN SELECT ... LOOP structures? Could it be 
 one of that cases in which I've to choose between either 
 higher speed but higher memory consume (3
 JOINs) or lower speed but less memory expense (3 FORs)?
 
 Thanks again and apologizes for extending this topic too much.
 
 
 Scott Carey wrote:
  On Feb 23, 2010, at 8:53 AM, Alvaro Herrera wrote:
 

  negora wrote:
 
  
  According to how I understood the process, the engine 
 would get the 
  name from the student with ID 1 and would look for the 
 name of the 
  father with ID 1 in the hashed table. It'd do exactly the 
 same with 
  the student #2 and father #2. But my big doubt is about 
 the 3rd one 
  (Anthony). Would the engine know that it already had 
 retrieved the 
  father's name for the student 1 and would avoid searching for it 
  into the hashed table (using some kind of internal 
 mechanism which 
  allows to re-utilize the name)? Or would it search into 
 the hashed 
  table again?br

  The hash table is searched again.  But that's fast, because it's a 
  hash table.
 
  
 
  To answer the question another way, remembering that it 
 has already seen father A once and tracking that would use a 
 hash table to remember that fact.  
 
  The hash table created by the first scan IS the remember 
 you have seen this father data structure, optimized for fast 
 lookup.  So before even looking at the first student, the 
 hash table is built so that it is fast to find out if a 
 father has been seen before, and if so where that father's 
 data is located.  Looking this data up is often referred to 
 as a probe and not a scan because it takes just as long 
 to do if the hash table has 100 entries or 1 entries.  
 The drawback is that the whole thing has to fit in RAM.
 
 

  -- 
  Alvaro Herrera
 http://www.CommandPrompt.com/
  The PostgreSQL Company - Command Prompt, Inc.
 
  --
  Sent via pgsql-performance mailing list 
  (pgsql-performance@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-performance
  
 
 

 

So, you are trying to do nested loop in PL/PgSQL.
Why not let optimizer decide between nested loop and hash join based
on your memory settings and statistics collected for objects involved?
I'm pretty sure, it'll be faster than PL/PgSQL 3 nested loops.

Igor Neyman

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Internal operations when the planner makes a hash join.

2010-02-23 Thread negora

Hello:

I'm an ignorant in what refers to performance analysis of PostgreSQL. 
I've a doubt about how the PostgreSQL planner makes a hash join. I've 
tried to dig into the archive of this mailing list but I haven't found 
what I'm looking for. So I'm explaining my doubt with an example to see 
if anyone can help me.


Let's suppose that I've 2 tables, one of students and the other one of 
parents in a many-to-one relation. I want to do something like this:


   SELECT s.complete_name, f.complete_name
   FROM students AS s
   JOIN fathers AS f ON f.id_father = s.id_father;

Using the ANALYZE command, I've checked that the planner firstly scans 
and extracts the required information from fathers, builds a temporary 
hash table from it, then scans students, and finally joins the 
information from this table and the temporary one employing the relation 
f.id_father = s.id_father.


My doubt is about this last step. When the planner checks the temporary 
table looking for the parent of a student:


A) Does it run through the temporary father's table one time per 
student? This means that if there are 500 students, it's doing 500 loops 
on the temporary table.
B) Or does it try to internally group students with the same father ID 
to avoid doing absurd loops on the temporary one?


That's all. Thank you very much for your kindness :) .



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Internal operations when the planner makes a hash join.

2010-02-23 Thread Kevin Grittner
negora neg...@negora.com wrote:
 
 I've a doubt about how the PostgreSQL planner makes a hash join.
 
 Let's suppose that I've 2 tables, one of students and the other
 one of parents in a many-to-one relation. I want to do something
 like this:
 
 SELECT s.complete_name, f.complete_name
 FROM students AS s
 JOIN fathers AS f ON f.id_father = s.id_father;
 
 Using the ANALYZE command, I've checked that the planner firstly
 scans and extracts the required information from fathers, builds
 a temporary hash table from it, then scans students, and finally
 joins the information from this table and the temporary one
 employing the relation f.id_father = s.id_father.
 
This sort of plan is sometimes used when the optimizer expects the
hash table to fit into RAM, based on statistics and your work_mem
setting.  If it does fit, that's one sequential scan of the father
table's heap, and a hashed lookup into RAM to find the father to
match each student.  For the sort of query you're showing, that's
typically a very good plan.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Internal operations when the planner makes a hash join.

2010-02-23 Thread negora




First of all, thank you for your fast answer,
Kevin :) .

However I still wonder if on the search into the hashed
table (stored in the RAM, as you're pointing out), it checks for
fathers as
many times as students are selected, or if the engine uses some kind of
intelligent heuristic to avoid searching for the same father more than
once.

For example:

students

id_student | name | id_father

1 | James | 1
2 | Laura | 2
3 | Anthony | 1


fathers (hashed table into RAM)

id_father | name

1 | John
2 | Michael


According to how I understood the process, the engine would get the
name from the student with ID 1 and would look for the name of the
father with ID 1 in the hashed table. It'd do exactly the same with the
student #2 and father #2. But my big doubt is about the 3rd one
(Anthony). Would the engine "know" that it already had retrieved the
father's name for the student 1 and would avoid searching for it into
the hashed table (using some kind of internal mechanism which allows to
"re-utilize" the name)? Or would it search into the hashed table again?

Thanks a lot for your patience :) .


Kevin Grittner wrote:

  negora neg...@negora.com wrote:
 
  
  
I've a doubt about how the PostgreSQL planner makes a hash join.

  
   
  
  
Let's suppose that I've 2 tables, one of students and the other
one of parents in a many-to-one relation. I want to do something
like this:

SELECT s.complete_name, f.complete_name
FROM students AS s
JOIN fathers AS f ON f.id_father = s.id_father;

Using the ANALYZE command, I've checked that the planner firstly
scans and extracts the required information from "fathers", builds
a temporary hash table from it, then scans "students", and finally
joins the information from this table and the temporary one
employing the relation "f.id_father = s.id_father".

  
   
This sort of plan is sometimes used when the optimizer expects the
hash table to fit into RAM, based on statistics and your work_mem
setting.  If it does fit, that's one sequential scan of the father
table's heap, and a hashed lookup into RAM to find the father to
match each student.  For the sort of query you're showing, that's
typically a very good plan.
 
-Kevin

  





Re: [PERFORM] Internal operations when the planner makes a hash join.

2010-02-23 Thread Alvaro Herrera
negora wrote:

 According to how I understood the process, the engine would get the
 name from the student with ID 1 and would look for the name of the
 father with ID 1 in the hashed table. It'd do exactly the same with the
 student #2 and father #2. But my big doubt is about the 3rd one
 (Anthony). Would the engine know that it already had retrieved the
 father's name for the student 1 and would avoid searching for it into
 the hashed table (using some kind of internal mechanism which allows to
 re-utilize the name)? Or would it search into the hashed table again?br

The hash table is searched again.  But that's fast, because it's a hash
table.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Internal operations when the planner makes a hash join.

2010-02-23 Thread Scott Carey

On Feb 23, 2010, at 8:53 AM, Alvaro Herrera wrote:

 negora wrote:
 
 According to how I understood the process, the engine would get the
 name from the student with ID 1 and would look for the name of the
 father with ID 1 in the hashed table. It'd do exactly the same with the
 student #2 and father #2. But my big doubt is about the 3rd one
 (Anthony). Would the engine know that it already had retrieved the
 father's name for the student 1 and would avoid searching for it into
 the hashed table (using some kind of internal mechanism which allows to
 re-utilize the name)? Or would it search into the hashed table again?br
 
 The hash table is searched again.  But that's fast, because it's a hash
 table.
 

To answer the question another way, remembering that it has already seen 
father A once and tracking that would use a hash table to remember that fact.  

The hash table created by the first scan IS the remember you have seen this 
father data structure, optimized for fast lookup.  So before even looking at 
the first student, the hash table is built so that it is fast to find out if a 
father has been seen before, and if so where that father's data is located.  
Looking this data up is often referred to as a probe and not a scan because 
it takes just as long to do if the hash table has 100 entries or 1 entries. 
 The drawback is that the whole thing has to fit in RAM.


 -- 
 Alvaro Herrerahttp://www.CommandPrompt.com/
 The PostgreSQL Company - Command Prompt, Inc.
 
 -- 
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Internal operations when the planner makes a hash join.

2010-02-23 Thread negora
Thank you for explaining me the internal behaviour of the PostgreSQL 
engine. I'll try to look for more information about that hash tables. It 
sounds really really interesting. Your information was very useful.


The origin of my doubt resides in the fact that I need to do a joint 
between 3 HUGE tables (millions of registries) and do certain operations 
with the retrieved information. I was deciding whether to use one SELECT 
with 3 JOINs, as I've been doing since the beginning, or build a 
PL/PgSQL function based on 3 nested FOR ... IN SELECT ... LOOP 
structures which tried to minimize the subsequent table searches storing 
intermediate useful data in arrays (curiously, these would act as the 
hash tables which you mention, but in a very very rudimentary way). In a 
case like this one (possibly unable to fit in RAM), Is also JOIN the 
best solution?


Since I've to retrieve such a big amount of columns and crossed 
registries I had started to think that using 1 SELECT with 3 JOINs would 
increase the number of table searches a LOT and also duplicate the 
information too much. I mean duplicate as in this case, where the 
Factor 1 appears millions of times for every Element:


Element 1 | Sub-factor 1 | Factor 1
Element 2 | Subf-actor 1 | Factor 1
...
Element 12639747465586 | Sub-factor 1 | Factor 1
Element 1 | Sub-factor 2 | Factor 1

I hope not to robber you much time but... What do you think about it? Is 
it better either 1 SELECT with 3 JOINs or build nested FOR ... IN 
SELECT ... LOOP structures? Could it be one of that cases in which I've 
to choose between either higher speed but higher memory consume (3 
JOINs) or lower speed but less memory expense (3 FORs)?


Thanks again and apologizes for extending this topic too much.


Scott Carey wrote:

On Feb 23, 2010, at 8:53 AM, Alvaro Herrera wrote:

  

negora wrote:



According to how I understood the process, the engine would get the
name from the student with ID 1 and would look for the name of the
father with ID 1 in the hashed table. It'd do exactly the same with the
student #2 and father #2. But my big doubt is about the 3rd one
(Anthony). Would the engine know that it already had retrieved the
father's name for the student 1 and would avoid searching for it into
the hashed table (using some kind of internal mechanism which allows to
re-utilize the name)? Or would it search into the hashed table again?br
  

The hash table is searched again.  But that's fast, because it's a hash
table.




To answer the question another way, remembering that it has already seen father A once and tracking that would use a hash table to remember that fact.  


The hash table created by the first scan IS the remember you have seen this father data 
structure, optimized for fast lookup.  So before even looking at the first student, the hash table is built 
so that it is fast to find out if a father has been seen before, and if so where that father's data is 
located.  Looking this data up is often referred to as a probe and not a scan because 
it takes just as long to do if the hash table has 100 entries or 1 entries.  The drawback is that the 
whole thing has to fit in RAM.


  

--
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




  


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Internal operations when the planner makes a hash join.

2010-02-23 Thread Kevin Grittner
negora neg...@negora.com wrote:
 
 The origin of my doubt resides in the fact that I need to do a
 joint between 3 HUGE tables (millions of registries) and do
 certain operations with the retrieved information. I was deciding
 whether to use one SELECT with 3 JOINs, as I've been doing since
 the beginning, or build a PL/PgSQL function based on 3 nested FOR
 ... IN SELECT ... LOOP structures which tried to minimize the
 subsequent table searches storing intermediate useful data in
 arrays
 
It's almost always faster (and less error prone) to write one SELECT
statement declaring what you want than to try to do better by
navigating individual rows procedurally.  I would *strongly*
recommend you write it with the JOINs and then post here if you have
any concerns about the performance.  In general, try to *declare*
what you want, and let the PostgreSQL planner sort out the best way
to navigate the tables to produce what you want.  If you hit some
particular weakness in the planner, you many need to coerce it, but
certainly you should not *start* with that.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Internal operations when the planner makes a hash join.

2010-02-23 Thread Kevin Grittner
negora neg...@negora.com wrote: 
 
 I even might return the entire result to my external Java
 application
 
You are probably going to want to configure it to use a cursor, at
least if the result set is large (i.e., too big to cache the entire
result set in memory before you read the first row).  Read this over
carefully:
 
http://jdbc.postgresql.org/documentation/84/query.html#query-with-cursor
 
You don't have to use a Java cursor or do anything procedurally
there, but a PostgreSQL cursor is the only way to stream the data to
the application on demand (ResultSet.next), rather than pushing it
all there during the Statement.execute().
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance