[PERFORM] Speeding up this function

2004-10-19 Thread Matt Nuzum
Hello, I've thought it would be nice to index certain aspects of my
apache log files for analysis. I've used several different techniques
and have something usable now, but I'd like to tweak it one step
further.

My first performance optimization was to change the logformat into a
CSV format.  I processed the logfiles with PHP and plsql stored
procedures. Unfortunately, it took more than 24 hours to process 1
days worth of log files.

I've now switched to using C# (using mono) to create hash-tables to do
almost all of the pre-processing.  This has brought the time down to
about 3 hours.  Actually, if I take out one step it brought the
process down to about 6 minutes, which is a tremendous improvement.

The one step that is adding 2.5+ hours to the job is not easily done
in C#, as far as I know.

Once the mostly-normalized data has been put into a table called
usage_raw_access I then use this query:
insert into usage_access select * , 
usage_normalize_session(accountid,client,atime) as sessionid 
from usage_raw_access;

All it does is try to link pageviews together into a session. 
here's the function:
 create or replace function usage_normalize_session (varchar(12),
inet, timestamptz) returns integer as '
 DECLARE
 -- $1 = Account ID, $2 = IP Address, $3 = Time
 RecordSet record;
 BEGIN
 SELECT INTO RecordSet DISTINCT sessionid FROM usage_access ua
 WHERE ua.accountid = $1
 AND ua.client = $2
 AND ua.atime = ($3 - ''20 min''::interval)::timestamptz;

 if found
 then return RecordSet.sessionid;
 end if;

 return nextval(''usage_session_ids'');
 END;'
 language plpgsql;

And the table usage_access looks like this:
   Table public.usage_access
  Column|   Type   | Modifiers
-+--+---
[snip]
client  | inet |
atime   | timestamp with time zone |
accountid   | character varying(12)|
sessionid   | integer  |
Indexes: usage_acccess_req_url btree (req_url),
usage_access_accountid btree (accountid),
usage_access_atime btree (atime),
usage_access_hostid btree (hostid),
usage_access_sessionid btree (sessionid)
usage_access_sessionlookup btree (accountid,client,atime);

As you can see, this looks for clients who have visited the same site
within 20 min.  If there is no match, a unique sessionid is assigned
from a sequence.  If there is a visit, the session id assigned to them
is used.  I'm only able to process about 25 records per second with my
setup.  My window to do this job is 3-4 hours and the shorter the
better.

Here is an explain analyze of the query I do (note I limited it to 1000):
EXPLAIN ANALYZE
insert into usage_access select * ,
usage_normalize_session(accountid,client,atime) as sessionid from
usage_raw_access limit 1000;
QUERY PLAN
--
 Subquery Scan *SELECT*  (cost=0.00..20.00 rows=1000 width=196)
(actual time=51.63..47634.22 rows=1000 loops=1)
   -  Limit  (cost=0.00..20.00 rows=1000 width=196) (actual
time=51.59..47610.23 rows=1000 loops=1)
 -  Seq Scan on usage_raw_access  (cost=0.00..20.00 rows=1000
width=196) (actual time=51.58..47606.14 rows=1001 loops=1)
 Total runtime: 48980.54 msec

I also did an explain of the query that's performed inside the function:

EXPLAIN ANALYZE
select sessionid from usage_access ua where ua.accountid =  'XYZ' and
ua.client = '64.68.88.45'::inet and ua.atime = '2003-11-02
04:50:01-05'::timestamptz;
  
  QUERY PLAN
--
Index Scan using usage_access_sessionlookup on usage_access ua 
(cost=0.00..6.02 rows=1 width=4) (actual time=0.29..0.29 rows=0
loops=1)
  Index Cond: ((accountid = 'XYZ'::character varying) AND (client =
'64.68.88.45'::inet) AND (atime = '2003-11-02 04:50:01-05'::timestamp
with time zone))
Total runtime: 0.35 msec
(3 rows)


What I'd really like to know is if someone knows a way to do any of
the following:
 a: Make the INSERT into ... SELECT *,usage_access_sessionlookup().. work faster
 b: Make the usage_access_sessionlookup() smarter,better,etc.
 c: Do this in C# using a hash-table or some other procedure that
would be quicker.
 d: Find an algorithm to create the sessionid without having to do any
database or hash-table lookups.  As the dataset gets bigger, it won't
fit in RAM and the lookup queries will become I/O bound, drastically
slowing things down.

d: is my first choice.

For some reason I just can't seem to get my mind around the data.  I
wonder if there's someway 

Re: [PERFORM] Speeding up this function

2004-10-19 Thread Jeremy Dunn
 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Matt Nuzum
 Sent: Tuesday, October 19, 2004 3:35 PM
 To: pgsql-performance
 Subject: [PERFORM] Speeding up this function
 
snip
 
 All it does is try to link pageviews together into a session. 
 here's the function:
  create or replace function usage_normalize_session 
 (varchar(12), inet, timestamptz) returns integer as '  DECLARE
  -- $1 = Account ID, $2 = IP Address, $3 = Time
  RecordSet record;
  BEGIN
  SELECT INTO RecordSet DISTINCT sessionid FROM usage_access ua
  WHERE ua.accountid = $1
  AND ua.client = $2
  AND ua.atime = ($3 - ''20 
 min''::interval)::timestamptz;
 
  if found
  then return RecordSet.sessionid;
  end if;
 
  return nextval(''usage_session_ids'');
  END;'
  language plpgsql;
 

This is probably a stupid question, but why are you trying to create
sessions after the fact?  Since it appears that users of your site must
login, why not just assign a sessionID to them at login time, and keep
it in the URL for the duration of the session?  Then it would be easy to
track where they've been.

- Jeremy


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Speeding up this function

2004-10-19 Thread Matt Nuzum
On Tue, 19 Oct 2004 15:49:45 -0400, Jeremy Dunn [EMAIL PROTECTED] wrote:
  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of
  Matt Nuzum
  Sent: Tuesday, October 19, 2004 3:35 PM
  To: pgsql-performance
  Subject: [PERFORM] Speeding up this function
 
 snip
snip
 
 This is probably a stupid question, but why are you trying to create
 sessions after the fact?  Since it appears that users of your site must
 login, why not just assign a sessionID to them at login time, and keep
 it in the URL for the duration of the session?  Then it would be easy to
 track where they've been.
 
 - Jeremy
 
 

You don't have to log in to visit the sites.  These log files are
actually for many domains.  Right now, we do logging with a web-bug
and it does handle the sessions, but it relies on javascript and we
want to track a lot more than we are now.  Plus, that code is in
JavaScript and one of our primary motiviations is to ditch MySQL
completely.

-- 
Matthew Nuzum   | Makers of Elite Content Management System
www.followers.net   | View samples of Elite CMS in action
[EMAIL PROTECTED]   | http://www.followers.net/portfolio/

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly