Re: [PERFORM] Looking for ideas on how to speed up warehouse loading

2004-04-26 Thread CoL
hi,
Sean Shanny wrote, On 4/22/2004 23:56:
SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral t1 
ON t2.url = t1.referral_raw_url ORDER BY t1.id
index on url (text) has no sense. Try to use and md5 (char(32) column) 
which contains the md5 hash of url field. and join these ones. You can 
have a better index on this char 32 field.

do not forget to analyze the tables after data load, and you can fine 
tune you postgresql.conf, default_statistics_target for better index 
info, and others.
check this info pages:
http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html
http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html

C.
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Looking for ideas on how to speed up warehouse loading

2004-04-23 Thread Aaron Werman
By definition, it is equivalent to:

SELECT t1.id, t2.url FROM referral_temp t2 LEFT /*OUTER*/ JOIN d_referral t1
ON t2.url = t1.referral_raw_url
union all
SELECT null, url FROM referral_temp WHERE url is null
ORDER BY 1;



/Aaron

- Original Message - 
From: Joe Conway [EMAIL PROTECTED]
To: Sean Shanny [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, April 23, 2004 12:38 AM
Subject: Re: [PERFORM] Looking for ideas on how to speed up warehouse
loading


 Sean Shanny wrote:
  explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
  JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;

  What I would like to know is if there are better ways to do the join?  I
  need to get all the rows back from the referral_temp table as they are
  used for assigning FK's for the fact table later in processing.  When I
  iterate over the values that I get back those with t1.id = null I assign
  a new FK and push both into the d_referral table as new entries as well
  as a text file for later use.  The matching records are written to a
  text file for later use.

 Would something like this work any better (without disabling index scans):

 SELECT t1.id, t2.url
 FROM referral_temp t2, d_referral t1
 WHERE t1.referral_raw_url = t2.url;

 process rows with a match

 SELECT t1.id, t2.url
 FROM referral_temp t2
 WHERE NOT EXISTS
 (select 1 FROM d_referral t1 WHERE t1.referral_raw_url = t2.url);

 process rows without a match

 ?

 Joe

 ---(end of broadcast)---
 TIP 7: don't forget to increase your free space map settings


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Looking for ideas on how to speed up warehouse loading

2004-04-23 Thread scott.marlowe

On Thu, 22 Apr 2004, Sean Shanny wrote:

 I should have included this as well:
  fsync  | on
  shared_buffers | 4000
  sort_mem   | 64000

For purposes of loading only, you can try turning off fsync, assuming this 
is a virgin load and you can just re-initdb should bad things happen (OS, 
postgresql crash, power plug pulled, etc...)

Also increasing sort_mem and shared_buffers might help.  Especially 
sort_mem.  But turn it back down to something reasonable after the import.

And turn fsync back on after the import too.  Note you have to restart 
postgresql to make fsync = off take effect.


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Looking for ideas on how to speed up warehouse loading

2004-04-22 Thread Sean Shanny
I should have included this as well:

show all;
 name  |setting
+
add_missing_from   | on
australian_timezones   | off
authentication_timeout | 60
check_function_bodies  | on
checkpoint_segments| 64
checkpoint_timeout | 30
checkpoint_warning | 30
client_encoding| UNICODE
client_min_messages| notice
commit_delay   | 0
commit_siblings| 5
cpu_index_tuple_cost   | 0.001
cpu_operator_cost  | 0.0025
cpu_tuple_cost | 0.01
DateStyle  | ISO, MDY
db_user_namespace  | off
deadlock_timeout   | 1000
debug_pretty_print | off
debug_print_parse  | off
debug_print_plan   | off
debug_print_rewritten  | off
default_statistics_target  | 1000
default_transaction_isolation  | read committed
default_transaction_read_only  | off
dynamic_library_path   | $libdir
effective_cache_size   | 40
enable_hashagg | on
enable_hashjoin| on
enable_indexscan   | on
enable_mergejoin   | on
enable_nestloop| on
enable_seqscan | on
enable_sort| on
enable_tidscan | on
explain_pretty_print   | on
extra_float_digits | 0
from_collapse_limit| 8
fsync  | on
geqo   | on
geqo_effort| 1
geqo_generations   | 0
geqo_pool_size | 0
geqo_selection_bias| 2
geqo_threshold | 11
join_collapse_limit| 8
krb_server_keyfile | unset
lc_collate | C
lc_ctype   | C
lc_messages| C
lc_monetary| C
lc_numeric | C
lc_time| C
log_connections| off
log_duration   | off
log_error_verbosity| default
log_executor_stats | off
log_hostname   | off
log_min_duration_statement | -1
log_min_error_statement| panic
log_min_messages   | notice
log_parser_stats   | off
log_pid| off
log_planner_stats  | off
log_source_port| off
log_statement  | off
log_statement_stats| off
log_timestamp  | on
max_connections| 100
max_expr_depth | 1
max_files_per_process  | 1000
max_fsm_pages  | 2
max_fsm_relations  | 1000
max_locks_per_transaction  | 64
password_encryption| on
port   | 5432
pre_auth_delay | 0
preload_libraries  | unset
random_page_cost   | 4
regex_flavor   | advanced
rendezvous_name| unset
search_path| $user,public
server_encoding| UNICODE
server_version | 7.4.1
shared_buffers | 4000
silent_mode| off
sort_mem   | 64000
sql_inheritance| on
ssl| off
statement_timeout  | 0
stats_block_level  | on
stats_command_string   | on
stats_reset_on_server_start| off
stats_row_level| on
stats_start_collector  | on
superuser_reserved_connections | 2
syslog | 0
syslog_facility| LOCAL0
syslog_ident   | postgres
tcpip_socket   | on
TimeZone   | unknown
trace_notify   | off
transaction_isolation  | read committed
transaction_read_only  | off
transform_null_equals  | off
unix_socket_directory  | unset
unix_socket_group  | unset
unix_socket_permissions| 511
vacuum_mem | 64000
virtual_host   | unset
wal_buffers| 1024
wal_debug  | 0
wal_sync_method| open_sync
zero_damaged_pages | off
Sean Shanny wrote:

To all,

Essentials:  Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB 
memory, direct attached via fibre channel to a fully optioned 3.5TB 
XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.

Background:  We are loading what are essentially xml based access logs 
from about 20+ webservers daily, about 6GB of raw data.  We have a 
classic star schema.  All the ETL tools are custom java code or 
standard *nix tools like sort, uniq etc...

The problem:  We have about 46 million rows in a table with the 
following schema:

Table public.d_referral
  Column   |  Type   | 

Re: [PERFORM] Looking for ideas on how to speed up warehouse loading

2004-04-22 Thread Nicholas Shanny
One other thing: we are running with a block size of 32K.

Nick Shanny
(Brother of above person)
On Apr 22, 2004, at 7:30 PM, Sean Shanny wrote:

I should have included this as well:

show all;
 name  |setting
+
add_missing_from   | on
australian_timezones   | off
authentication_timeout | 60
check_function_bodies  | on
checkpoint_segments| 64
checkpoint_timeout | 30
checkpoint_warning | 30
client_encoding| UNICODE
client_min_messages| notice
commit_delay   | 0
commit_siblings| 5
cpu_index_tuple_cost   | 0.001
cpu_operator_cost  | 0.0025
cpu_tuple_cost | 0.01
DateStyle  | ISO, MDY
db_user_namespace  | off
deadlock_timeout   | 1000
debug_pretty_print | off
debug_print_parse  | off
debug_print_plan   | off
debug_print_rewritten  | off
default_statistics_target  | 1000
default_transaction_isolation  | read committed
default_transaction_read_only  | off
dynamic_library_path   | $libdir
effective_cache_size   | 40
enable_hashagg | on
enable_hashjoin| on
enable_indexscan   | on
enable_mergejoin   | on
enable_nestloop| on
enable_seqscan | on
enable_sort| on
enable_tidscan | on
explain_pretty_print   | on
extra_float_digits | 0
from_collapse_limit| 8
fsync  | on
geqo   | on
geqo_effort| 1
geqo_generations   | 0
geqo_pool_size | 0
geqo_selection_bias| 2
geqo_threshold | 11
join_collapse_limit| 8
krb_server_keyfile | unset
lc_collate | C
lc_ctype   | C
lc_messages| C
lc_monetary| C
lc_numeric | C
lc_time| C
log_connections| off
log_duration   | off
log_error_verbosity| default
log_executor_stats | off
log_hostname   | off
log_min_duration_statement | -1
log_min_error_statement| panic
log_min_messages   | notice
log_parser_stats   | off
log_pid| off
log_planner_stats  | off
log_source_port| off
log_statement  | off
log_statement_stats| off
log_timestamp  | on
max_connections| 100
max_expr_depth | 1
max_files_per_process  | 1000
max_fsm_pages  | 2
max_fsm_relations  | 1000
max_locks_per_transaction  | 64
password_encryption| on
port   | 5432
pre_auth_delay | 0
preload_libraries  | unset
random_page_cost   | 4
regex_flavor   | advanced
rendezvous_name| unset
search_path| $user,public
server_encoding| UNICODE
server_version | 7.4.1
shared_buffers | 4000
silent_mode| off
sort_mem   | 64000
sql_inheritance| on
ssl| off
statement_timeout  | 0
stats_block_level  | on
stats_command_string   | on
stats_reset_on_server_start| off
stats_row_level| on
stats_start_collector  | on
superuser_reserved_connections | 2
syslog | 0
syslog_facility| LOCAL0
syslog_ident   | postgres
tcpip_socket   | on
TimeZone   | unknown
trace_notify   | off
transaction_isolation  | read committed
transaction_read_only  | off
transform_null_equals  | off
unix_socket_directory  | unset
unix_socket_group  | unset
unix_socket_permissions| 511
vacuum_mem | 64000
virtual_host   | unset
wal_buffers| 1024
wal_debug  | 0
wal_sync_method| open_sync
zero_damaged_pages | off
Sean Shanny wrote:

To all,

Essentials:  Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB  
memory, direct attached via fibre channel to a fully optioned 3.5TB  
XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.

Background:  We are loading what are essentially xml based access  
logs from about 20+ webservers daily, about 6GB of raw data.  We have  
a classic star schema.  All the ETL tools are custom java code or  
standard *nix tools like sort, 

Re: [PERFORM] Looking for ideas on how to speed up warehouse loading

2004-04-22 Thread Tom Lane
Sean Shanny [EMAIL PROTECTED] writes:
 explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER 
 JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;
  
 QUERY PLAN
 --
  Sort  (cost=4012064.81..4013194.45 rows=451856 width=115) (actual 
 time=1297320.823..1297739.813 rows=476176 loops=1)
Sort Key: t1.id
-  Hash Left Join  (cost=1052345.95..3969623.10 rows=451856 
 width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
  Hash Cond: (outer.url = inner.referral_raw_url)
  -  Seq Scan on referral_temp t2  (cost=0.00..6645.56 
 rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 loops=1)
  -  Hash  (cost=729338.16..729338.16 rows=46034716 width=124) 
 (actual time=1146440.710..1146440.710 rows=0 loops=1)
-  Seq Scan on d_referral t1  (cost=0.00..729338.16 
 rows=46034716 width=124) (actual time=14.502..-1064277.123 rows=46034715 
 loops=1)
  Total runtime: 1298153.193 ms
 (8 rows)

 What I would like to know is if there are better ways to do the join?

What have you got sort_mem set to?  You might try increasing it to a gig
or so, since you seem to have plenty of RAM in that box ...

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Looking for ideas on how to speed up warehouse loading

2004-04-22 Thread Joe Conway
Sean Shanny wrote:
explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER 
JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;

What I would like to know is if there are better ways to do the join?  I 
need to get all the rows back from the referral_temp table as they are 
used for assigning FK's for the fact table later in processing.  When I 
iterate over the values that I get back those with t1.id = null I assign 
a new FK and push both into the d_referral table as new entries as well 
as a text file for later use.  The matching records are written to a 
text file for later use.
Would something like this work any better (without disabling index scans):

SELECT t1.id, t2.url
FROM referral_temp t2, d_referral t1
WHERE t1.referral_raw_url = t2.url;
process rows with a match

SELECT t1.id, t2.url
FROM referral_temp t2
WHERE NOT EXISTS
(select 1 FROM d_referral t1 WHERE t1.referral_raw_url = t2.url);
process rows without a match

?

Joe

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings