Hello everyone,

I'm doing some web logs analysis. I decided to use a sqlite database
because the application has to be easily portable to another
workstation or server and there is absolutly no need of multi-clients.
Moreover, I don't want to spend my time dealing with the database.

What I want to do is quite simple. I want to parse the logs from a
high traffic apache website (loadbalanced on 8 front servers) and
integrated them in a database. That's the easy part. In apache
configuration I added the mod_usertrack in order to be able to follow
a user in the logs.

Then I want to create parent / child relations between requests from a
unique user (id by the mod_usertrack). The final goal is to see all
users. Then by user we can see the requests issued from a click on a
webpage (manual human interface --> parent). Finaly for each parent
requests (click) I want to have all sub requests issued from the
parent page (automatic fetch (images, css, javascript, ...)).

The global aim of all this is to analyse the behaviour of visitors on
our website and to extract different populations of users and make
statistics.

Here are the tables I use:

CREATE TABLE IF NOT EXISTS requests (
 id integer primary key autoincrement,
 date date not null,
 request varchar not null,
 method varchar not null,
 code integer not null,
 referer varchar,
 response_time integer not null,
 vhost varchar DEFAULT NULL,
 cookie not null REFERENCES cookies(id))

CREATE TABLE IF NOT EXISTS cookies (
 id integer primary key autoincrement,
 cookie varchar not null UNIQUE)

CREATE TABLE IF NOT EXISTS siblings (
 parent not null REFERENCES requests(id),
 child not null REFERENCES requests(id))

I ran a test on 2 hours logs from the 18th of february (from 10:00AM
to 11:59AM).

The first step is to parse the logs in order to populate the
`requests` and `cookies` tables. It took about 2min to parse and
insert 238304 cookies and 1163728 requests. After this step the
database is 317Mo sized.

Then I have to populate the `siblings` table. I use the following request:

INSERT INTO siblings (parent, child) SELECT a.id, b.id FROM requests a
INNER JOIN requests b ON a.cookie = b.cookie WHERE a.page = b.referer
AND b.date BETWEEN a.date AND a.date + 20

It took about 3hours and 30 minutes. At the end there is 7158001 rows
in the `siblings` table and the database is about 600Mo sized.

My questions:
1- Is there a better way to populate the `siblings` table ? (optimize
my sql request)
2- What can I do to optimize the all process ? (play with some PRAGMA
parameters maybe)
3- Is sqlite a good choice for this case ? Would mysql or postgresql a
better choice ?

Thanks in advance for your time.

++ Jerome
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to