[PERFORM] Indexing question
Hello all, I have a rather large table (~20 GB) of network logs taken over the period of a month and stored under postgres 7.3. I'm trying to create an indexing scheme so that I can do a query with a where time foo and time bar and get the results without too much waiting. With this in mind, I created a new database ordering each log entry by the time values (seconds, and microseconds), and then created an index on the time value for seconds. I didn't use the clusterdb command, but I did do a select * into foo_tbl order by time, which I understand to be the functional equivalent. The result looks something like this: Table public.allflow_tv_mydoom Column | Type | Modifiers +-+--- tv_s | bigint | tv_us | bigint | src| inet| dst| inet| sport | integer | dport | integer | bytes_in | bigint | bytes_out | bigint | isn_in | bigint | isn_out| bigint | num_starts | integer | result | integer | flags | integer | age_s | bigint | age_us | bigint | ttl_left | bigint | end_s | bigint | end_us | bigint | Indexes: allflow_tv_mydoom_x btree (tv_s) with tv_s, of course, being my value in seconds. I followed this up with an ANALYZE so postrgres could absorb this structure into its logic. However, whenever I want to do my query, it *insists* on doing a sequential scan, even when I explicitly turn sequential scan off in postgres.conf... here's an example: standb=# explain select * from allflow_tv_mydoom where tv_s 1074200099 and tv_s 107506499; QUERY PLAN --- Seq Scan on allflow_tv_mydoom (cost=1.00..102303307.94 rows=1 width=132) Filter: ((tv_s 1074200099) AND (tv_s 107506499)) (2 rows) In this query I'm basically asking for a day's worth of data. It should be straightforward: it's all laid out in order and indexed, but it still insists on (what I believe to be) tearing through the entire DB and filtering out the other time values as it sees them. Even if I want just the data from a particular second, it does the same thing: standb=# explain select * from allflow_tv_mydoom where tv_s = 1074200099; QUERY PLAN - Seq Scan on allflow_tv_mydoom (cost=1.00..102140682.45 rows=145 width=132) Filter: (tv_s = 1074200099) (2 rows) Naturally, this is incredibly frustrating because it takes forever, regardless of the query. The funny thing though is that I have the same data in another table where it is ordered and indexed by IP address, and the queries use the index and work the way I want them to. Example: standb=# explain select * from flow_ip_mydoom where src = '10.0.5.5'; QUERY PLAN - Index Scan using flow_ip_mydoom_x on flow_ip_mydoom (cost=0.00..333.41 rows=376 width=132) Index Cond: (src = '10.0.5.5'::inet) (2 rows) Can someone please explain to me what's going on and how to fix it? If there's an easier way to 'jump' to different time regions in the table without indexing all the different seconds values, I'd like to know this as well. Thanks a bunch, -S ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Indexing question
Stan Bielski [EMAIL PROTECTED] writes: Table public.allflow_tv_mydoom Column | Type | Modifiers +-+--- tv_s | bigint | ^^ Indexes: allflow_tv_mydoom_x btree (tv_s) standb=# explain select * from allflow_tv_mydoom where tv_s 1074200099 and tv_s 107506499; [ gives seqscan ] This is a FAQ :-(. Unadorned integer constants are taken to be int4 not int8 (unless they are too large for int4), and cross-data-type comparisons are not indexable in existing releases. So you have to explicitly cast the comparison values to int8: explain select * from allflow_tv_mydoom where tv_s 1074200099::bigint and tv_s 107506499::bigint; (or use the standard CAST syntax if you prefer). 7.5 will have a fix for this ancient annoyance. BTW, is there a reason to be using tv_s+tv_us and not just a single timestamptz column? regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Indexing question
So if I understand correctly I could ditch the 'deleted' field entirely and use just the 'deleteddate' field. This 'deleteddate' field would be NULL by default. It would contain a date value if the record is considered 'deleted'. The index would be 'create index a on tablename(deleteddate) where deleteddate is null'. I could then access 'current' records with a view like 'create view x_view as select * from tablename where deleteddate is null'. Is that correct? This would be the best performing solution for this kind of thing, I think (theoretically at least)? Kind regards, Alexander Priem. - Original Message - From: Tomasz Myrta [EMAIL PROTECTED] To: Alexander Priem [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, August 29, 2003 9:03 AM Subject: Re: [PERFORM] Indexing question Hi all, I have some tables (which can get pretty large) in which I want to record 'current' data as well as 'historical' data. This table has fields 'deleted' and 'deleteddate' (among other fields, of course). The field 'deleted' is false be default. Every record that I want to delete gets the value true for 'deleted' and 'deleteddate' is set to the date of deletion. Since these tables are used a lot by queries that only use 'current' data, I have created a view with a where clause 'Where not deleted'. Also, I have indexed field 'deleted'. cut I think the best choice for your case is using conditional indexes. It should be much better than indexing 'deleted' field. I don't know on which exactly fields you have to create this index - you have to check it by yourself - what do you have in where clause? Example: create index some_index on your_table(id_field) where not deleted; Regards, Tomasz Myrta ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Indexing question
So if I understand correctly I could ditch the 'deleted' field entirely and use just the 'deleteddate' field. This 'deleteddate' field would be NULL by default. It would contain a date value if the record is considered 'deleted'. The index would be 'create index a on tablename(deleteddate) where deleteddate is null'. I could then access 'current' records with a view like 'create view x_view as select * from tablename where deleteddate is null'. Is that correct? This would be the best performing solution for this kind of thing, I think (theoretically at least)? Kind regards, Alexander Priem. Near, but not exactly. You don't need field deleted - it's true. Your example: create index a on tablename(deleteddate) where deleteddate is null we can translate to: create index a on tablename(NULL) where deleteddate is null which doesn't make too much sense. Check your queries. You probably have something like this: select * from tablename where not deleted and xxx Create your index to match xxx clause - if xxx is some_id=13, then create your index as: create index on tablename(some_id) where deleteddate is null; Regards, Tomasz Myrta ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Indexing question
So if I understand correctly I could ditch the 'deleted' field entirely and use just the 'deleteddate' field. This 'deleteddate' field would be NULL by default. It would contain a date value if the record is considered 'deleted'. The index would be 'create index a on tablename(deleteddate) where deleteddate is null'. I could then access 'current' records with a view like 'create view x_view as select * from tablename where deleteddate is null'. Is that correct? This would be the best performing solution for this kind of thing, I think (theoretically at least)? Yes, I think it would be best. Definitely better than your current solution. Cheers, Chris ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Indexing question
create index orad_id_index on orderadvice (orad_id) where orad_deleteddate is null; create index orad_name_index on orderadvice (orad_name) where orad_deleteddate is null; create view orderadvice_edit as select orad_id,orad_name,orad_description,orad_value,orad_value_quan from orderadvice where orad_deleteddate is null; Would queries like 'select * from orderadvice_edit where orad_id=100' or 'select * from orderadvice_edit order by orad_name' both use one of these two partial indexes, given enough records are present in the table? There would be a double index on the primary key this way, right? It looks much better now. I'm not sure about the second index. Probably it will be useless, because you sort ALL records with deleteddtata is null. Maybe the first index will be enough. I'm not sure what to do with doubled index on a primary key field. Regards, Tomasz Myrta ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Indexing question
Well, the intention is to hold every record that ever existed in the table. Therefore, records do not get deleted, but they get a date in the deleteddate field. This way, we can track what changes were made to the table(s). So if a record gets 'deleted', the field 'deleted' is set to today's date. If a record gets 'updated', a new record is made containing the new data, and the old record is marked as 'deleted'. So the percentage of 'deleted' records will grow with time, if you understand what I mean. - Original Message - From: Tom Lane [EMAIL PROTECTED] To: Alexander Priem [EMAIL PROTECTED] Cc: Tomasz Myrta [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Friday, August 29, 2003 4:00 PM Subject: Re: [PERFORM] Indexing question Alexander Priem [EMAIL PROTECTED] writes: Does anyone know whether it is bad practise to have two indexes on the primary key of a table? (one 'primary key' index and one partial index) It's a little unusual, but if you get enough performance boost from it to justify the maintenance cost of the extra index, then I can't see anything wrong with it. The if is worth checking though. I missed the start of this thread, but what percentage of your rows do you expect to have null deleteddate? Unless it's a pretty small percentage, I'm unconvinced that the extra indexes will be worth their cost. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Indexing question
On Fri, Aug 29, 2003 at 05:13:52PM +0200, Alexander Priem wrote: Well, the intention is to hold every record that ever existed in the table. Therefore, records do not get deleted, but they get a date in the deleteddate field. This way, we can track what changes were made to the table(s). So if a record gets 'deleted', the field 'deleted' is set to today's date. If a record gets 'updated', a new record is made containing the new data, and the old record is marked as 'deleted'. So the percentage of 'deleted' records will grow with time, if you understand what I mean. Did you consider a two table implimentation. 1 table live_table containing the non-deleted records, a second table deleted_table containing the deleted records, along with the deleted_date field. Keep the two in sync column type/number wise, and use a before delete trigger function on live_table to actually insert a copy of the deleted row plus deleted_date into deleted_table before performing the delete on live_table. You could also use a before update trigger to keep old copies of updated records in the same way. Then you would only incur the performance loss of scanning/etc. the deleted records when you actually need to pull up deleted plus live records. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org