Re: [PERFORM] Bulk loading/merging
On 06/02/2006, Michael Artz wrote: hmm, I don't quite understand what you are saying and I think my basicmisunderstanding is how to use the UPDATE... WHERE EXISTS to merge data in bulk. Assuming that I bulk COPYed the data into a temporary table,I'd need to issue an UPDATEfor each row in the newly created table, right? For example, for a slightly different key,count schema: CREATE TABLE kc (key integer, count integer); and wanting to merge the following data by just updating the count for a given key to the equivalent of OLD.count + NEW.count: 1,10 2,15 3,45 1,30 How would I go about using UPDATE ... WHERE EXISTS to update the master kc table from a (temporary) table loaded with the above data? May be, this method could help you: CREATE TEMP TABLE clip_temp ( cids int8 NOT NULL, clip_id int8 NOT NULL, mentions int4 DEFAULT 0, CONSTRAINT pk_clip_temp PRIMARY KEY (cids, clip_id)) ) insert data into this temporary table... then do: UPDATE clip_category SET mentions=clip_temp.mentions FROM clip_temp WHERE clip_category.cids=clip_temp.cids AND clip_category.clip_id=clip_temp.clip_id DELETE FROM clip_temp USING clip_category WHERE clip_temp.cids=clip_category.cids AND clip_temp.clip_id=clip_category.clip_id INSERT INTO clip_category (cids, clip_id, mentions) SELECT * FROM clip_temp DROP TABLE clip_temp; Best regards, ahmad fajar,
Re: [PERFORM] serious problems with vacuuming databases
Hi Tomas, Tomas wrote: We've decided to remove unneeded 'old' data, which means removing about 99.999% of rows from tables A, C and D (about 2 GB of data). At the beginning, the B table (containing aggregated from A, C and D) was emptied (dropped and created) and filled in with current data. Then, before the deletion the data from tables A, C, D were backed up using another tables (say A_old, C_old, D_old) filled in using . 1) drop, create and fill table B (aggregated data from A, C, D) 2) copy 'old' data from A, C and D to A_old, C_old a D_old 3) delete old data from A, C, D 4) dump data from A_old, C_old and D_old 5) truncate tables A, C, D 6) vacuum full analyze tables A, C, D, A_old, C_old and D_old I think you do some difficult database maintainance. Why you do that, if you just want to have some small piece of datas from your tables. Why don't you try something like: 1. create table A with no index (don't fill data to this table), 2. create table A_week_year inherit table A, with index you want, and some condition for insertion. (eg: table A1 you used for 1 week data of a year and so on..) 3. do this step for table B, C and D 4. if you have relation, make the relation to inherit table (optional). I think you should read the postgresql help, for more information about table inheritance. The impact is, you might have much table. But each table will only have small piece of datas, example: just for one week. And you don't have to do a difficult database maintainance like you have done. You just need to create tables for every week of data, do vacuum/analyze and regular backup. Best regards, ahmad fajar, ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Multiple Order By Criteria
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Rabu, 18 Januari 2006 07:23 To: Stephan Szabo Cc: Josh Berkus; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Multiple Order By Criteria I've read all of this info, closely. I wish when I was searching for an answer for my problem these pages came up. Oh well. Well, I think you have to know about btree index. Btree is good enough, although it's not better. It will perform best, if it doesn't index too many multiple column. In your case, you have to consentrate on 2 or 3 fields that will use frequently. Put the most duplicate value on the front and others are behind. Eq: receipt, carrier_id, batchnum is the most frequently use, but the most duplicate value are: carrier_id, receipt, and batchnum so make btree index (carrier_id, receipt, batchnum). Btree will not suffer, and we also will advantage if the table have relationship with other table with the same fields order. We have not to make another index for that relation. Best regards, ahmad fajar. I am getting an idea of what I need to do to make this work well. I was wondering if there is more information to read on how to implement this solution in a more simple way. Much of what's written seems to be towards audience that should understand certain things automatically. - Original Message - From: Stephan Szabo [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: Josh Berkus josh@agliodbs.com; pgsql-performance@postgresql.org Sent: Tuesday, January 17, 2006 6:39 PM Subject: Re: [PERFORM] Multiple Order By Criteria On Tue, 17 Jan 2006 [EMAIL PROTECTED] wrote: I created the index like this: CREATE INDEX rcbee_idx ON detail_summary USING btree (receipt, carrier_id, batchnum, encounternum, encounter_id); Is this correct ? That would work if you were asking for all the columns ascending or descending, but we don't currently use it for mixed orders. How do I make a reverse opclass ? There's some information at the following: http://archives.postgresql.org/pgsql-novice/2005-10/msg00254.php http://archives.postgresql.org/pgsql-general/2005-01/msg00121.php http://archives.postgresql.org/pgsql-general/2004-06/msg00565.php ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Text/Varchar performance...
Dear Cristian, If you need to index the field, you must know that it limit the length up to 1000 bytes. So if you need to index the field you must limit the field type, ex: varchar(250), than you can index the field and you can gain better perfomance in searching base on the fields, because the search uses the index you have been created. If you do not need to index the field, you can use the text field. Because text field can store data up to 4 Gbytes. Regards, ahmad fajar -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Cristian Prieto Sent: Kamis, 06 Oktober 2005 1:22 To: pgsql-general@postgresql.org; pgsql-performance@postgresql.org Subject: [PERFORM] Text/Varchar performance... Hello, just a little question, It's preferable to use Text Fields or varchar(255) fields in a table? Are there any performance differences in the use of any of them? Thanks a lot for your answer! ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Query seem to slow if table have more than 200 million rows
Hi Qingqing, I don't know whether the statistic got is bad or good, this is the statistic: scooby=# select a.relid, a.relname, b.indexrelid, b.indexrelname, c.idx_scan, c.idx_tup_read, c.idx_tup_fetch, scooby-# a.heap_blks_read, a.heap_blks_hit, a.idx_blks_read, a.idx_blks_hit, scooby-# a.toast_blks_read, a.toast_blks_hit, a.tidx_blks_read, a.tidx_blks_hit, b.idx_blks_read, b.idx_blks_hit scooby-# from pg_statio_user_tables a, pg_statio_user_indexes b, pg_stat_all_indexes c scooby-# where a.relid=b.relid and a.relid=c.relid and b.indexrelid=c.indexrelid and a.relname=b.relname and scooby-# a.relname=c.relname and a.relname='fti_dict1'; relid | relname | indexrelid | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch | heap_blks_read | heap_blks_hit | idx _blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | tidx_blks_read | tidx_blks_hit | idx_blks_read | idx_blks_hit --+---++--+--+-- +---++---+ ---+--+-++-- --+---+---+-- 22880226 | fti_dict1 | 22880231 | idx_dict3|0 |0 | 0 | 0 | 0 | 0 |0 | || | | 0 |0 22880226 | fti_dict1 | 22880230 | idx_dict2|7 | 592799 |592799 | 0 | 0 | 0 |0 | || | | 0 |0 22880226 | fti_dict1 | 22880229 | idx_dict1|0 |0 | 0 | 0 | 0 | 0 |0 | || | | 0 |0 (3 rows) I have try several time the query below with different keyword, but I just got idx_tup_read and idx_tup_fetch changed, others keep zero. The Index are: Ids (Idx_dict1), keywords (idx_dict2 varchar_ops), keywords (idx_dict3 varchar_pattern_ops) == I use this index for query ... keywords like 'blabla%', just for testing purpose Regards, ahmad fajar -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Qingqing Zhou Sent: Selasa, 27 September 2005 8:43 To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query seem to slow if table have more than 200 million rows Ahmad Fajar [EMAIL PROTECTED] wrote Select ids, keywords from dict where keywords='blabla' ('blabla' is a single word); The table have 200 million rows, I have index the keywords field. On the first time my query seem to slow to get the result, about 15-60 sec to get the result. But if I repeat the query I will get fast result. My question is why on the first time the query seem very slow. Table structure is quite simple: Ids bigint, keywords varchar(150), weight varchar(1), dpos int. The first slowness is obviously caused by disk IOs. The second time is faster because all data pages it requires are already in buffer pool. 200 million rows is not a problem for btree index, even if your client tool appends some spaces to your keywords at your insertion time, the ideal btree is 5 to 6 layers high at most. Can you show the iostats of index from your statistics view? http://www.postgresql.org/docs/8.0/static/monitoring-stats.html#MONITORING-S TATS-VIEWS Regards, Qingqing ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] tsearch2 seem very slow
Hi Oleg, Sorry for my late. From the stat() function I got 1,5 million rows, although I've added garbage words to the stop word file, there seem still have garbage words. So I ask for my team to identify the garbage words and add to stop words and I will update the articles after that. And about my articles, it is quite big enough, the average length is about 2900 characters. And I think, I have to tune tsearch2 and concentrate to the garbage words. The most articles are indonesian language. What others way to tune the tsearch2 beside the garbage words? Beside that, I still have problem, if I do a simple query like: Select ids, keywords from dict where keywords='blabla' ('blabla' is a single word); The table have 200 million rows, I have index the keywords field. On the first time my query seem to slow to get the result, about 15-60 sec to get the result. I use latest pgAdmin3 to test all queries. But if I repeat the query I will get fast result. My question is why on the first time the query seem to slow. I try to cluster the table base on keyword index, but after 15 hours waiting and it doesn't finish I stop clustering. Now I think I have to change the file system for postgresql data. Do you have any idea what best for postgresql, JFS or XFS? I will not try reiserfs, because there are some rumors about reiserfs stability, although reiserfs is fast enough for postgresql. And must I down grade my postgresql from version 8.0.3 to 7.4.8? Regards, ahmad fajar -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Oleg Bartunov Sent: Saturday, September 24, 2005 1:08 PM To: Ahmad Fajar Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] tsearch2 seem very slow Ahmad, what's about the number of unique words ? I mean stat() function. Sometimes, it helps to identify garbage words. How big is your articles (average length) ? please, cut'n paste queries and output from psql ! How fast are next queries ? Oleg On Fri, 23 Sep 2005, Ahmad Fajar wrote: Hi Oleg, For single index I try this query: explain analyze select articleid, title, datee from articles where fti @@ to_tsquery('bankindonesia'); analyze result: Index Scan using fti_idx on articles (cost=0.00..862.97 rows=420 width=51) (actual time=0.067..183761.324 rows=46186 loops=1) Index Cond: (fti @@ '\'bank\' \'indonesia\''::tsquery) Total runtime: 183837.826 ms And for multicolumn index I try this query: explain analyze select articleid, title, datee from articles where fti @@ to_tsquery('bankmega'); analyze result: Index Scan using articles_x1 on articles (cost=0.00..848.01 rows=410 width=51) (actual time=52.204..37914.135 rows=1841 loops=1) Index Cond: ((datee = '2002-01-01'::date) AND (datee = ('now'::text)::date) AND (fti @@ '\'bank\' \'mega\''::tsquery)) Total runtime: 37933.757 ms The table structure is as mention on the first talk. If you wanna know how much table in my database, it's about 100 tables or maybe more. Now I develop the version 2 of my web application, you can take a look at: http://www.mediatrac.net, so it will hold many datas. But the biggest table is article's table. On develop this version 2 I just use half data of the article's table (about 419804 rows). May be if I import all of the article's table data it will have 1 million rows. The article's table grows rapidly, about 10 rows per-week. My developing database size is 28 GB (not real database, coz I still develop the version 2 and I use half of the data for play around). I just wanna to perform quick search (fulltext search) on my article's table not other table. On version 1, the current running version I use same hardware spesification as mention below, but there is no fulltext search. So I develop the new version with new features, new interface and include the fulltext search. I do know, if the application finish, I must use powerfull hardware. But how can I guarantee the application will run smooth, if I do fulltext search on 419804 rows in a table it took a long time to get the result. Could you or friends in this maling-list help meplz..plzz Tsearch2 configuration: - I use default configuration, english stop word file as tsearch2 provide, stem dictionary as default (coz I don't know how to configure and add new data to stem dictionary) and I add some words to the english stop word file. Postgresql configuration - max_connections = 32 shared_buffers = 32768 sort_mem = 8192 vacuum_mem = 65536 work_mem = 16384 maintenance_work_mem = 65536 max_fsm_pages = 3 max_fsm_relations = 1000 max_files_per_process = 10 checkpoint_segments = 15 effective_cache_size = 192000 random_page_cost = 2 geqo = true geqo_threshold = 50 geqo_effort = 5 geqo_pool_size = 0 geqo_generations = 0 geqo_selection_bias = 2.0 from_collapse_limit = 10 join_collapse_limit
[PERFORM] Query seem to slow if table have more than 200 million rows
If I do a simple query like: Select ids, keywords from dict where keywords='blabla' ('blabla' is a single word); The table have 200 million rows, I have index the keywords field. On the first time my query seem to slow to get the result, about 15-60 sec to get the result. But if I repeat the query I will get fast result. My question is why on the first time the query seem very slow. Table structure is quite simple: Ids bigint, keywords varchar(150), weight varchar(1), dpos int. I use latest pgAdmin3 to test all queries. My linux box is Redhat 4 AS, kernel 2.6.9-11, postgresql version 8.0.3, 2x200 GB SATA 7200 RPM configure as RAID0 with ext3 file system for postgresql data only. 80 GB EIDE 7200 RPM with ext3 file system for OS only. The server has 2 GB RAM with P4 3,2 GHz. If I do this query on mssql server, with the same hardware spesification and same data, mssql server beat postgresql, the query about 0-4 sec to get the result. What wrong with my postgresql. wassalam, ahmad fajar
Re: [PERFORM] tsearch2 seem very slow
Hi Oleg, Thanks, I will read your documentation again, and try to understand what I miss. And about pgmanual, it is very help me. I'll take attention on that. Regards, ahmad fajar -Original Message- From: Oleg Bartunov [mailto:[EMAIL PROTECTED] Sent: Monday, September 26, 2005 3:12 AM To: Ahmad Fajar Cc: pgsql-performance@postgresql.org Subject: RE: [PERFORM] tsearch2 seem very slow Ahmad, On Mon, 26 Sep 2005, Ahmad Fajar wrote: Hi Oleg, what king of garbage ? Probably you index not needed token types, for example, email address, file names do you need proximity ? If no, use strip(tsvector) function to remove coordinate information from tsvector. I need proximity. Some time I have to rank my article and make a chart for that. don't index default configuration and index only needed tokens, for example, to index only 3 type of tokens, first create 'qq' configuration and specify tokens to index. insert into pg_ts_cfg values('qq','default','en_US'); -- tokens to index insert into pg_ts_cfgmap values('qq','lhword','{en_ispell,en_stem}'); insert into pg_ts_cfgmap values('qq','lword','{en_ispell,en_stem}'); insert into pg_ts_cfgmap values('qq','lpart_hword','{en_ispell,en_stem}'); I still don't understand about tsearch2 configuration, so until now I just use default configuration. I will try your suggestion. But how can I get the en_ispell? Does my system will know if I use: ,'{en_ispell,en_stem}'; From default configuration I only see: ..., '{en_stem}'; I think you should read documentation. I couldn't explain you things already written. Beside that, I still have problem, if I do a simple query like: Select ids, keywords from dict where keywords='blabla' ('blabla' is a single word); The table have 200 million rows, I have index the keywords field. On the first time my query seem to slow to get the result, about 15-60 sec to get the result. I use latest pgAdmin3 to test all queries. But if I repeat the query I will get fast result. My question is why on the first time the query seem to slow. because index pages should be readed from disk into shared buffers, so next query will benefit from that. You need enough shared memory to get real benefit. You may get postgresql stats and look on cache hit ration. btw, how does your query ( keywords='blabla') relates to tsearch2 ? (Keywords='blabla') isn't related to tsearch2, I just got an idea from tsearch2 and try different approach. But I stuck on the query result speed. Very slow to get result on the first query. And how to see postgresql stats and look on cache hit ratio? I still don't know how to get it. learn from http://www.postgresql.org/docs/8.0/static/monitoring-stats.html I try to cluster the table base on keyword index, but after 15 hours waiting and it doesn't finish I stop clustering. don't use cluster for big tables ! simple select * into clustered_foo from foo order by indexed_field would be faster and does the same job. What the use of clustered_foo table? And how to use it? I think it will not distinct duplicate rows. And the clustered_foo table still not have an index, so if query to this table, I think the query will be very slow to get a result. oh guy, you certainly need to read documentation http://www.postgresql.org/docs/8.0/static/sql-cluster.html Regards, ahmad fajar Regards, Oleg _ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] tsearch2 seem very slow
Hi Oleg, For single index I try this query: explain analyze select articleid, title, datee from articles where fti @@ to_tsquery('bankindonesia'); analyze result: Index Scan using fti_idx on articles (cost=0.00..862.97 rows=420 width=51) (actual time=0.067..183761.324 rows=46186 loops=1) Index Cond: (fti @@ '\'bank\' \'indonesia\''::tsquery) Total runtime: 183837.826 ms And for multicolumn index I try this query: explain analyze select articleid, title, datee from articles where fti @@ to_tsquery('bankmega'); analyze result: Index Scan using articles_x1 on articles (cost=0.00..848.01 rows=410 width=51) (actual time=52.204..37914.135 rows=1841 loops=1) Index Cond: ((datee = '2002-01-01'::date) AND (datee = ('now'::text)::date) AND (fti @@ '\'bank\' \'mega\''::tsquery)) Total runtime: 37933.757 ms The table structure is as mention on the first talk. If you wanna know how much table in my database, it's about 100 tables or maybe more. Now I develop the version 2 of my web application, you can take a look at: http://www.mediatrac.net, so it will hold many datas. But the biggest table is article's table. On develop this version 2 I just use half data of the article's table (about 419804 rows). May be if I import all of the article's table data it will have 1 million rows. The article's table grows rapidly, about 10 rows per-week. My developing database size is 28 GB (not real database, coz I still develop the version 2 and I use half of the data for play around). I just wanna to perform quick search (fulltext search) on my article's table not other table. On version 1, the current running version I use same hardware spesification as mention below, but there is no fulltext search. So I develop the new version with new features, new interface and include the fulltext search. I do know, if the application finish, I must use powerfull hardware. But how can I guarantee the application will run smooth, if I do fulltext search on 419804 rows in a table it took a long time to get the result. Could you or friends in this maling-list help meplz..plzz Tsearch2 configuration: - I use default configuration, english stop word file as tsearch2 provide, stem dictionary as default (coz I don't know how to configure and add new data to stem dictionary) and I add some words to the english stop word file. Postgresql configuration - max_connections = 32 shared_buffers = 32768 sort_mem = 8192 vacuum_mem = 65536 work_mem = 16384 maintenance_work_mem = 65536 max_fsm_pages = 3 max_fsm_relations = 1000 max_files_per_process = 10 checkpoint_segments = 15 effective_cache_size = 192000 random_page_cost = 2 geqo = true geqo_threshold = 50 geqo_effort = 5 geqo_pool_size = 0 geqo_generations = 0 geqo_selection_bias = 2.0 from_collapse_limit = 10 join_collapse_limit = 15 OS configuration: -- I use Redhat 4 AS, kernel 2.6.9-11 kernel.shmmax=1073741824 kernel.sem=250 32000 100 128 fs.aio-max-nr=5242880 the server I configure just only for postgresql, no other service is running like: www, samba, ftp, email, firewall hardware configuration: Motherboard ASUS P5GD1 Processor P4 3,2 GHz Memory 2 GB DDR 400, 2x200 GB Serial ATA 7200 RPM UltraATA/133, configure as RAID0 for postgresql data and the partition is EXT3 1x80 GB EIDE 7200 RPM configure for system and home directory and the partiton is EXT3 Did I miss something? Regards, ahmad fajar -Original Message- From: Oleg Bartunov [mailto:[EMAIL PROTECTED] Sent: Jumat, 23 September 2005 18:26 To: Ahmad Fajar Cc: pgsql-performance@postgresql.org Subject: RE: [PERFORM] tsearch2 seem very slow On Fri, 23 Sep 2005, Ahmad Fajar wrote: Hi Oleg, I didn't deny on the third repeat or more, it can reach 600 msec. It is only because the result still in postgres cache, but how about in the first run? I didn't dare, the values is un-acceptable. Because my table will grows rapidly, it's about 10 rows per-week. And the visitor will search anything that I don't know, whether it's the repeated search or new search, or whether it's in postgres cache or not. if you have enoush shared memory postgresql will keep index pages there. I just compare with http://www.postgresql.org, the search is quite fast, and I don't know whether the site uses tsearch2 or something else. But as fas as I know, if the rows reach 100 milion (I have try for 200 milion rows and it seem very slow), even if don't use tsearch2, only use simple query like: select f1, f2 from table1 where f2='blabla', and f2 is indexes, my postgres still slow on the first time, about 10 sec. because of this I tried something brand new to fullfill my needs. I have used fti, and tsearch2 but still slow. I don't know what's going wrong with my postgres, what configuration must I do to perform the query get fast result. Or must I use enterprisedb 2005 or pervasive
[PERFORM] tsearch2 seem very slow
I have about 419804 rows in my article table. I have installed tsearch2 and its gist index correctly. My table structure is: CREATE TABLE tbarticles ( articleid int4 NOT NULL, title varchar(250), mediaid int4, datee date, content text, contentvar text, mmcol float4 NOT NULL, sirkulasi float4, page varchar(10), tglisidata date, namapc varchar(12), usere varchar(12), file_pdf varchar(255), file_pdf2 varchar(50), kolom int4, size_jpeg int4, journalist varchar(120), ratebw float4, ratefc float4, fti tsvector, CONSTRAINT pk_tbarticles PRIMARY KEY (articleid) ) WITHOUT OIDS; Create index fti_idx1 on tbarticles using gist (fti); Create index fti_idx2 on tbarticles using gist (datee, fti); But when I search something like: Select articleid, title, datee from tbarticles where fti @@ to_tsquery(susilobambangyudhoyonojusufkalla); It takes about 30 sec. I run explain analyze and the index is used correctly. Then I try multi column index to filter by date, and my query something like: Select articleid, title, datee from tbarticles where fti @@ to_tsquery(susilobambangyudhoyonojusufkalla) and datee = '2002-01-01' and datee = current_date An it still run about 25 sec. I do run explain analyze and my multicolumn index is used correctly. This is not acceptable if want to publish my website if the search took very longer. I have run vacuum full analyze before doing such query. What going wrong with my query?? Is there any way to make this faster? I have try to tune my postgres configuration, but it seem helpless. My linux box is Redhat 4 AS, and the hardware: 2 GB RAM DDR 400, 2x200 GB Serial ATA 7200RPM and configure as RAID0 (just for postgres data), my sistem run at EIDE 80GB 7200 RPM. Pleasehelphelp