[PERFORM] Indexing question

2004-06-20 Thread Stan Bielski
Hello all,

I have a rather large table (~20 GB) of network logs taken over the period
of a month and stored under postgres 7.3. I'm trying to create an indexing
scheme so that I can do a query with a where time  foo and time  bar
and get the results without too much waiting. With this in mind, I created
a new database ordering each log entry by the time values (seconds, and
microseconds), and then created an index on the time value for seconds. I
didn't use the clusterdb command, but I did do a select * into foo_tbl
order by time, which I understand to be the functional equivalent. The
result looks something like this: 

 Table public.allflow_tv_mydoom
   Column   |  Type   | Modifiers
+-+---
 tv_s   | bigint  |
 tv_us  | bigint  |
 src| inet|
 dst| inet|
 sport  | integer |
 dport  | integer |
 bytes_in   | bigint  |
 bytes_out  | bigint  |
 isn_in | bigint  |
 isn_out| bigint  |
 num_starts | integer |
 result | integer |
 flags  | integer |
 age_s  | bigint  |
 age_us | bigint  |
 ttl_left   | bigint  |
 end_s  | bigint  |
 end_us | bigint  |
Indexes: allflow_tv_mydoom_x btree (tv_s) 

with tv_s, of course, being my value in seconds. I followed this up with
an ANALYZE so postrgres could absorb this structure into its logic. 
However, whenever I want to do my query, it *insists* on doing a
sequential scan, even when I explicitly turn sequential scan off in
postgres.conf...  here's an example: 

standb=# explain select * from allflow_tv_mydoom where tv_s  1074200099
and tv_s  107506499;
QUERY PLAN
---
 Seq Scan on allflow_tv_mydoom  (cost=1.00..102303307.94 rows=1
width=132)
   Filter: ((tv_s  1074200099) AND (tv_s  107506499))
(2 rows)  

In this query I'm basically asking for a day's worth of data. It should be
straightforward: it's all laid out in order and indexed, but it still
insists on (what I believe to be) tearing through the entire DB and
filtering out the other time values as it sees them. Even if I want just
the data from a particular second, it does the same thing:

standb=# explain select * from allflow_tv_mydoom where tv_s = 1074200099;
 QUERY PLAN
-
 Seq Scan on allflow_tv_mydoom  (cost=1.00..102140682.45 rows=145
width=132)
   Filter: (tv_s = 1074200099)
(2 rows) 

Naturally, this is incredibly frustrating because it takes forever,
regardless of the query.

The funny thing though is that I have the same data in another table
where it is ordered and indexed by IP address, and the queries use the
index and work the way I want them to. Example:

standb=# explain select * from flow_ip_mydoom where src = '10.0.5.5';
 QUERY PLAN
-
 Index Scan using flow_ip_mydoom_x on flow_ip_mydoom  (cost=0.00..333.41
rows=376 width=132)
   Index Cond: (src = '10.0.5.5'::inet)
(2 rows)

Can someone please explain to me what's going on and how to fix it? If
there's an easier way to 'jump' to different time regions in the table
without indexing all the different seconds values, I'd like to know this
as well. 

Thanks a bunch,
-S





---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Indexing question

2004-06-20 Thread Tom Lane
Stan Bielski [EMAIL PROTECTED] writes:
  Table public.allflow_tv_mydoom
Column   |  Type   | Modifiers
 +-+---
  tv_s   | bigint  |
^^
 Indexes: allflow_tv_mydoom_x btree (tv_s) 

 standb=# explain select * from allflow_tv_mydoom where tv_s  1074200099
 and tv_s  107506499;
 [ gives seqscan ]

This is a FAQ :-(.  Unadorned integer constants are taken to be int4
not int8 (unless they are too large for int4), and cross-data-type
comparisons are not indexable in existing releases.  So you have to
explicitly cast the comparison values to int8:

explain select * from allflow_tv_mydoom where tv_s  1074200099::bigint
and tv_s  107506499::bigint;

(or use the standard CAST syntax if you prefer).

7.5 will have a fix for this ancient annoyance.

BTW, is there a reason to be using tv_s+tv_us and not just a single
timestamptz column?

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Indexing question

2003-08-29 Thread Alexander Priem
So if I understand correctly I could ditch the 'deleted' field entirely and
use just the 'deleteddate' field. This 'deleteddate' field would be NULL by
default. It would contain a date value if the record is considered
'deleted'.

The index would be 'create index a on tablename(deleteddate) where
deleteddate is null'.

I could then access 'current' records with a view like 'create view x_view
as select * from tablename where deleteddate is null'.

Is that correct? This would be the best performing solution for this kind of
thing, I think (theoretically at least)?

Kind regards,
Alexander Priem.



- Original Message -
From: Tomasz Myrta [EMAIL PROTECTED]
To: Alexander Priem [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, August 29, 2003 9:03 AM
Subject: Re: [PERFORM] Indexing question


  Hi all,
 
  I have some tables (which can get pretty large) in which I want to
  record 'current' data as well as 'historical' data. This table has
  fields 'deleted' and 'deleteddate' (among other fields, of course). The
  field 'deleted' is false be default. Every record that I want to delete
  gets the value true for 'deleted' and 'deleteddate' is set to the date
  of deletion.
 
  Since these tables are used a lot by queries that only use 'current'
  data, I have created a view with a where clause 'Where not deleted'.
  Also, I have indexed field 'deleted'.

 cut
 I think the best choice for your case is using conditional indexes. It
 should be much better than indexing 'deleted' field. I don't know on
 which exactly fields you have to create this index - you have to check
 it by yourself - what do you have in where clause?

 Example:
 create index some_index on your_table(id_field) where not deleted;


 Regards,
 Tomasz Myrta


 ---(end of broadcast)---
 TIP 8: explain analyze is your friend


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Indexing question

2003-08-29 Thread Tomasz Myrta
So if I understand correctly I could ditch the 'deleted' field entirely and
use just the 'deleteddate' field. This 'deleteddate' field would be NULL by
default. It would contain a date value if the record is considered
'deleted'.
The index would be 'create index a on tablename(deleteddate) where
deleteddate is null'.
I could then access 'current' records with a view like 'create view x_view
as select * from tablename where deleteddate is null'.
Is that correct? This would be the best performing solution for this kind of
thing, I think (theoretically at least)?
Kind regards,
Alexander Priem.
Near, but not exactly. You don't need field deleted - it's true.

Your example:
create index a on tablename(deleteddate) where deleteddate is null
we can translate to:
create index a on tablename(NULL) where deleteddate is null
which doesn't make too much sense.
Check your queries. You probably have something like this:
select * from tablename where not deleted and xxx
Create your index to match xxx clause - if xxx is some_id=13, then 
create your index as:
create index on tablename(some_id) where deleteddate is null;

Regards,
Tomasz Myrta
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Indexing question

2003-08-29 Thread Christopher Kings-Lynne

 So if I understand correctly I could ditch the 'deleted' field entirely
and
 use just the 'deleteddate' field. This 'deleteddate' field would be NULL
by
 default. It would contain a date value if the record is considered
 'deleted'.

 The index would be 'create index a on tablename(deleteddate) where
 deleteddate is null'.

 I could then access 'current' records with a view like 'create view x_view
 as select * from tablename where deleteddate is null'.

 Is that correct? This would be the best performing solution for this kind
of
 thing, I think (theoretically at least)?

Yes, I think it would be best.  Definitely better than your current
solution.

Cheers,

Chris


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Indexing question

2003-08-29 Thread Tomasz Myrta
create index orad_id_index on orderadvice (orad_id) where orad_deleteddate
is null;
create index orad_name_index on orderadvice (orad_name) where
orad_deleteddate is null;
create view orderadvice_edit as select
orad_id,orad_name,orad_description,orad_value,orad_value_quan from
orderadvice where orad_deleteddate is null;
Would queries like 'select * from orderadvice_edit where orad_id=100' or
'select * from orderadvice_edit order by orad_name' both use one of these
two partial indexes, given enough records are present in the table?
There would be a double index on the primary key this way, right?
It looks much better now. I'm not sure about the second index. Probably 
it will be useless, because you sort ALL records with deleteddtata is 
null. Maybe the first index will be enough.

I'm not sure what to do with doubled index on a primary key field.

Regards,
Tomasz Myrta
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Indexing question

2003-08-29 Thread Alexander Priem
Well, the intention is to hold every record that ever existed in the table.
Therefore, records do not get deleted, but they get a date in the
deleteddate field. This way, we can track what changes were made to the
table(s).

So if a record gets 'deleted', the field 'deleted' is set to today's date.
If a record gets 'updated', a new record is made containing the new data,
and the old record is marked as 'deleted'.

So the percentage of 'deleted' records will grow with time, if you
understand what I mean.



- Original Message -
From: Tom Lane [EMAIL PROTECTED]
To: Alexander Priem [EMAIL PROTECTED]
Cc: Tomasz Myrta [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Friday, August 29, 2003 4:00 PM
Subject: Re: [PERFORM] Indexing question


 Alexander Priem [EMAIL PROTECTED] writes:
  Does anyone know whether it is bad practise to have two indexes on the
  primary key of a table? (one 'primary key' index and one partial index)

 It's a little unusual, but if you get enough performance boost from it
 to justify the maintenance cost of the extra index, then I can't see
 anything wrong with it.

 The if is worth checking though.  I missed the start of this thread,
 but what percentage of your rows do you expect to have null deleteddate?
 Unless it's a pretty small percentage, I'm unconvinced that the extra
 indexes will be worth their cost.

 regards, tom lane


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Indexing question

2003-08-29 Thread Richard Ellis
On Fri, Aug 29, 2003 at 05:13:52PM +0200, Alexander Priem wrote:
 Well, the intention is to hold every record that ever existed in the table.
 Therefore, records do not get deleted, but they get a date in the
 deleteddate field. This way, we can track what changes were made to the
 table(s).
 
 So if a record gets 'deleted', the field 'deleted' is set to today's date.
 If a record gets 'updated', a new record is made containing the new data,
 and the old record is marked as 'deleted'.
 
 So the percentage of 'deleted' records will grow with time, if you
 understand what I mean.

Did you consider a two table implimentation.  1 table live_table
containing the non-deleted records, a second table deleted_table
containing the deleted records, along with the deleted_date field.  Keep
the two in sync column type/number wise, and use a before delete trigger
function on live_table to actually insert a copy of the deleted row plus
deleted_date into deleted_table before performing the delete on
live_table.

You could also use a before update trigger to keep old copies of updated
records in the same way.

Then you would only incur the performance loss of scanning/etc. the deleted
records when you actually need to pull up deleted plus live records.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org